Message Passing and Combinatorial Optimization 

by 

Siamak Ravanbakhsh 

A thesis submitted in partial fulfillment of the requirements for the degree of 

Doctor of Philosophy 

Department of Computing Science 
University of Alberta 


@ Siamak Ravanbakhsh, 2015 




Abstract 


Graphical models use the intuitive and well-studied methods of graph theory to implicitly represent 
dependencies between variables in large systems. They can model the global behaviour of a complex 
system by specifying only local factors.This thesis studies inference in discrete graphical models from 
an “algebraic perspective” and the ways inference can be used to express and approximate NP-hard 
combinatorial problems. 

We investigate the complexity and reducibility of various inference problems, in part by organizing 
them in an inference hierarchy. We then investigate tractable approximations for a subset of these 
problems using distributive law in the form of message passing. The quality of the resulting message 
passing procedure, called Belief Propagation (BP), depends on the influence of loops in the graphical 
model. We contribute to three classes of approximations that improve BP for loopy graphs (I) loop 
correction techniques; (II) survey propagation, another message passing technique that surpasses BP 
in some settings; and (III) hybrid methods that interpolate between deterministic message passing and 
Markov Chain Monte Carlo inference. 

We then review the existing message passing solutions and provide novel graphical models and infer¬ 
ence techniques for combinatorial problems under three broad classes: (I) constraint satisfaction prob¬ 
lems (CSPs) such as satisfiability, coloring, packing, set / clique-cover and dominating / independent set 
and their optimization counterparts; (II) clustering problems such as hierarchical clustering, K-median, 
K-clustering, K-center and modularity optimization; (III) problems over permutations including (bot¬ 
tleneck) assignment, graph “morphisms” and alignment, finding symmetries and (bottleneck) traveling 
salesman problem. In many cases we show that message passing is able to find solutions that are either 
near optimal or favourably compare with today’s state-of-the-art approaches. 
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Introduction 

J\^any complicated systems can be modeled as a graphieaj structure with interacting local func- 
tions. Many fields have (almost independently) discovered this: graphical models have been used 
in bioinformatics (protein folding, medical imaging and spectroscopy, pedagogy trees, regulatory net¬ 
works [27, 178, 224, 255, 323]), neuroscience (formation of associative memory and neuroplastic¬ 
ity [10, 65]), communication theory (low density parity check codes [106, 290]), statistical physics 
(physics of dense matter and spin-glass theory [211]), image processing (inpainting, stereo/texture re¬ 
construction, denoising and super-resolution [98, 101]), compressed sensing [84], robotics [294] (particle 
filters), sensor networks [76, 143], social networks [203, 307], natural language processing [200], speech 
recognition [73, 131], artihcial intelligence (artificial neural networks, Bayesian networks [244, 316]) and 
combinatorial optimization. This thesis is concerned with the application of graphical models in 
solving combinatorial optimization problems [122, 222, 287], which broadly put, seeks an “optimal” 
assignment to a discrete set of variables, where a brute foree approach is infeasible. 

To see how the decomposition offered by a graphical model can model a complex system, consider 
a joint distribution over 200 binary variables. A naive way to represent this would require a table with 
2200 ej^tries. However if variables are conditionally independent such that their dependence structure 
forms a tree, we can exactly represent the joint distribution using only 200 x 2^ values. Operations such 
as marginalization, which require computation time linear in the original size, are now reduced to local 
computation in the form of message passing on this structure (i.e., tree), which in this case, reduces 
the cost to linear in the new exponentially smaller size. It turns out even if the dependency structure 
has loops, we can use message passing to perform “approximate” inference. 

Moreover, we approach the problem of inference from an algebraic point of view [7]. This is in 
contrast to the variational perspective on local computation [303]. These two perspectives are to some 
extent “residuals” from the different origins of research in AI and statistical physics. 

In the statistical study of physical systems, the Boltzmann distribution relates the probability of each 
state of a physical system to its energy, which is often decomposed due to local interactions [208, 211]. 
These studies have been often interested in modeling systems at the thermodynamic limit of infinite 
variables and the average behaviour through the study of random ensembles. Inference techniques with 
this origin {e.g., mean-field and cavity methods) are often asymptotically exact under these assumptions. 


2 


Most importantly these studies have reduced inference to optimization through the notion of free energy 
-a.k.a. variational approach. 

In contrast, graphical models in the AI community have emerged in the study of knowledge repre¬ 
sentation and reasoning under uncertainty [245]. These advances are characterized by their attention 
to the theory of computation and logic [17], where interest in computational (as opposed to analytical) 
solutions has motivated the study of approximability, computational complexity [74, 267] and invention 
of inference techniques such as belief propagation that are efficient and exact on tree structures. Also, 
these studies have lead to algebraic abstractions in modeling systems that allow local computation 
[185, 279]. 

The common foundation underlying these two approaches is information theory, where derivation of 
probabilistic principles from logical axioms [146] leads to notions such as entropy and divergences that 
are closely linked to their physical counter-parts f.e., entropy and free energies in physical systems. At 
a less abstract level, it was shown that inference techniques in AI and communication are attempting 
to minimize (approximations to) free energy [6, -326]. 

Another exchange of ideas between the two fields was in the study of critical phenomenon in random 
constraint satisfaction problems by both computer scientists and physicists [103, 215, 216]; satisfiability 
is at the heart of theory of computation and an important topic to investigate reasoning in AI. On 
the other hand, the study of critical phenomena and phase transitions is central in statistical physics 
of disordered systems. This was culminated when a variational analysis lead to discovery of survey 
propagation [213] for constraint satisfaction, which significantly advanced the state-of-the-art in solving 
random satisfiability problems. 

Despite this convergence, variational and algebraic perspectives are to some extent complementary - 
e.g., the variational approach does not extend beyond (log) probabilities, while the algebraic approach 
cannot justify application of message passing to graphs with loops. Although we briefly review the 
variational perspective, this thesis is mostly concerned with the algebraic perspective. In particular, 
rather than the study of phase transitions and the behavionr of the set of solutions for combinatorial 
problems, we are concerned with finding solutions to individual instances. 

Part I starts by expressing the general form of inference, proposes a novel inference hierarchy and 
studies its complexity in chapter 1. Here, we also show how some of these problems are reducible to 
others and introduce the algebraic structures that make efficient inference possible. The general form 
of notation and the reductions that are proposed in this chapter are used in later chapters. 

Chapter 2 studies some forms of approximate inference, by first introducing belief propagation. It 
then considers the problems with intractably large number of factors and factors with large cardinality, 
then proposes/reviews solutions to both problems. We then study different modes of inference as 
optimization and review alternatives such as convergent procedures and convex and linear programming 
relaxations for some inference classes in the inference hierarchy. Standard message passing using belief 
propagation is only gnaranteed to be exact if the graphical structure has no loops. This optimization 
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perspective (a.k.a. variational perspective) has also led to design of approximate inference techniques 
that account for short loops in the graph. A different family of loop correction techniques can account 
for long loops by taking message dependencies into account. This chapter reviews these methods and 
introduces a novel loop correction scheme that can account for both short and long loops, resulting in 
more accurate inference over difficult instances. 

Message passing over loopy graphs can be seen as a fixed point iteration procedure, and the existence 
of loops means there may be more than one fixed point. Therefore an alternative to loop correction is to 
in some way incorporate all fixed points. This can be performed also by a message passing procedure, 
known as survey propagation. The next section of this chapter introduces survey propagation from 
a novel algebraic perspective that enables performing inference on the set of fixed points. Another 
major approach to inference is offered by Markov Chain Monte Carlo (MCMC) techniques. After a 
minimal review of MCMC, the final section of this chapter introduces a hybrid inference procedure, 
called perturbed belief propagation that interpolates between belief propagation and Gibbs sampling. 
We show that this technique can outperform both belief propagation and Gibbs sampling in particular 
settings. 

Part II of this thesis uses the inference techniques derived in the first part to solve a wide range 
of combinatorial problems. We review the existing message passing solutions and provide novel formu¬ 
lations for three broad classes of problems: 1) constraint satisfaction problems (CSPs), 2) clustering 
problems and 3) combinatorial problems over permutations. 

In particular, in chapter 3 we use perturbed belief propagation and perturbed survey propagation to 
obtain state-of-the-art performance in random satisfiability and coloring problems. We also introduce 
novel message passing solutions and review the existing methods for sphere packing, set-cover, clique- 
cover, dominating-set and independent-set and several of their optimization counterparts. By applying 
perturbed belief propagation to graphical representation of packing problem, we are able to compute 
long “optimal” nonlinear binary codes with large number of digits. 

Chapter 4 proposes message passing solutions to several clustering problems such as K-clustering, 
K-center and Modularity optimization and shows that message passing is able to find near-optimal 
solutions on moderate instances of these problems. Here, we also review the previous approaches to 
K-median and hierarchical clustering and also the related graphical models for minimum spanning tree 
and prize-collecting Steiner tree. 

Chapter 5 deals with combinatorial problems over permutations, by first reviewing the existing 
graphical models for matching, approximation of permanent, and graph alignment and introducing 
two novel message passing solutions for min-sum and min-max versions of traveling salesman problem 
(a.k.a. bottleneck TSP). We then study graph matching problems, including (sub-)graph isomorphism, 
monomorphism, homomorphism, graph alignment and “approximate” symmetries. In particular, in 
the study of graph homomorphism we show that its graphical model generalizes that of of several 
other problems, including Hamiltonian cycle, clique problem and coloring. We further show how graph 
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homomorphism can be used as a surrogate for isomorphism to find symmetries. 


Contributions and acknowledgment 

The results in this thesis are a joint work with my supervisor Dr. Greiner and other researchers. In 
detail, the algebraic approach to inference is presented in [256]; the loop correction ideas are published 
in [254]; perturbation schemes for CSP are presented in [253]; performing min-max inference was first 
suggested by Dr. Brendan Frey and Christopher Srinivasa, and many of the related ideas including 
min-max reductions are presented in our joint paper [258]. Finally, the augmentation scheme for TSP 
and Modularity maximization is discussed in [257]. 

The contribution of this thesis, including all the published work is as follows: 

- Generalization of inference problems in graphical models including: 

• The inference hierarchy. 

• The limit of distributive law on tree structures. 

• All the theorems, propositions and claims on complexity of inference, including 

o NP-hardness of inference in general commutative semirings. 

- A unified treatment of different modes of inference over factor-graphs and identification of their 
key properties (e.g., significance of inverse operator) in several settings including: 

• Loop correction schemes. 

• Survey propagation equations. 

- Reduction of min-max inference to min-sum and sum-product inference. 

- Simplified form of loop correction in Markov networks and their generalization to incorporate short 
loops over regions. 

~ A novel algebraic perspective on survey propagation. 

- Perturbed BP and perturbed SP and their application to constraint satisfaction problems. 

- Factor-graph augmentation for inference over intractably large number of constraints. 

- Factor-graph formulation for several combinatorial problems including 

• Glique-cover. 

• Independent-set, set-cover and vertex cover. 

• Dominating-set and packing (the binary-variable model) 

• Packing with hamming distances 

• K-center problem, K-clustering and clique model for modularity optimization. 

• TSP and bottleneck TSP. 

- The general framework for study of graph matching, including 

• Subgraph isomorphism.^ 

^Although some previous work [44] claim to address the same problem, we note that their formulation is for sub-graph 
monomorphism rather than isomorphism. 




Study of message passing for Homomorphism and finding approximate symmetries. 
Graph alignment with a diverse set of penalties. 



Part I 


Inference by message passing 
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This part of the thesis first studies the representation formalism, hierarchy of inference problems, 
reducibilities and the underlying algebraic structure that allows efficient inference in the form of mes¬ 
sage passing in graphical models, in chapter 1. By viewing inference under different lights, we then 
review/introduce procedures that allow better approximations in chapter 2. 



Chapter 1 

Representation, complexity and 
redncibility 


In this chapter, we use a simple algebraic structure - f.e., commutative semigroup - to express a general 
form for inference in graphical models. To this end, we first introduce the factor-graph representation 
and formalize inference in section 1.1. Section 1.2 focuses on four operations defined by summation, 
multiplication, minimization and maximization, to construct a hierarchy of inference problems within 
PSP ACE, such that the problems in the same class of the hierarchy belong to the same complexity 
class. Here, we encounter some new inference problems and establish the completeness of problems at 
lower levels of hierarchy w.r.t. their complexity classes. In section 1.3 we augment our simple structures 
with two properties to obtain message passing on commutative semirings. Here, we also observe that 
replacing a semigroup with an Abelian group, gives us normalized marginalization as a form of inference 
inquiry. Here, we show that inference in any commutative semiring is NP-hard and postpone further 
investigation of message passing to the next chapter. Section 1.4 shows how some of the inference 
problems introduced so far are reducible to others. 

1.1 The problem of inference 

We use commutative semigroups to both define what a graphical model represents and also to define 
inference over this graphical model. The idea of using structures such as semigroups, monoids and 
semirings in expressing inference has a long history[33, 185, 275]. Our approach, based on factor- 
graphs [181] and commutative semigroups, generalizes a variety of previous frameworks, including 
Markov networks [68], Bayesian networks [243], Forney graphs [100], hybrid models [83], influence 
diagrams [137] and valuation networks [278]. 

In particular, the combination of factor-graphs and semigroups that we consider here generalizes 
the plausibility, feasibility and utility framework of Pralet et al. [250], which is explicitly reduced to the 
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graphical models mentioned above and many more. The main difference in our approach is in keeping 
the framework free of semantics (e.g., decision and chance variables, utilities, constraints), that are 
often associated with variables, factors and operations, without changing the expressive power. These 
notions can later be associated with individual inference problems to help with interpretation. 

Definition 1.1.1. A commutative semigroup is a pair ^ where y* is a set and ® : 

y* xy* ^ y* is a binary operation that is (I) associative: a0(6(8)c) = {a®b)®c and (II) commutative: 
a®b = b®a for all a,b,c € y*. A commutative monoid is a commutative semigroup plus an identity 

element 1 such that a® 1 = a- If every element a G T* has an inverse a ^ (often written -), such that 

_ ® ® 

a® a ^ = 1 , and a ® 1 = a, the commutative monoid is an Abelian group. 

Here, the associativity and commutativity properties of a commutative semigroup make the opera¬ 
tions invariant to the order of elements. In general, these properties are not “vital” and one may define 
inference starting from a magma. 

Example 1.1.1. Some examples of semigroups are: 

~ The set of strings with the concatenation operation forms a semigroup with the empty string as 
the identity element. However this semigroup is not commutative. 

- The set of natural numbers M with summation defines a commutative semigroup. 

- Integers modulo n with addition defines an Abelian group. 

- The power-set 2“^ of any set S, with intersection operation defines a commutative semigroup with 
S as its identity element. 

- The set of natural numbers with greatest common divisor defines a commutative monoid with 0 as 
its identity. In fact any semilattice is a commutative semigroup [79]. 

- Given two commutative semigroups on two sets y* and 2*, their Cartesian product is also a 
commutative semigroup. 

Let X = {xi,... ,xn) be a tuple of N discrete variables Xi G A), where A) is the domain of Xi 
and X G A" = A"! x ... x AV- Let I C A/" = {1,2,..., A} denote a subset of variable indices and 
Xj = {xj I z G 1} G Ai be the tuple of variables in x indexed by the subset 1. A factor fj : Aj —)> Ti is a 
function over a subset of variables and Ti = {^ 1 (^ 1 ) I £ '^ 1 } is the range of this factor. 

Definition 1.1.2. A factor-graph is a pair (A, i^) such that 

• T = {fi} is a collection of factors with collective range y = Ui-^i- 

. |A| =Poly(A). 

magma [247] generalizes a semigroup, as it does not require associativity property nor an identity element. Inference 
in graphical models can be also extended to use magma (in definition 1.1.2). For this, the elements of T* and/or X should 
be ordered and/or parenthesized so as to avoid ambiguity in the order of pairwise operations over the set. Here, to avoid 
unnecessary complications, we confine our treatment to commutative semigroups. 
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• fi has a polynomial representation in N and it is possible to evaluate fi(xj) VI, Xj in polynomial 
time. 

• = {y*,0) is a commutative semigroup, where 3^* is the closure of y w.r.t. (8). 

The factor-graph compactly represents the expanded (joint) form 

q(^) = (^fi(s) (1-1) 

I 

Note that the connection between the set of factors J- and the commutative semigroup is through 
the “range” of factors. The conditions of this dehnition are necessary and sufficient to 1) compactly 
represent a factor-graph and 2) evaluate the expanded form, q(x), in polynomial time. A stronger 
condition to ensure that a factor has a compact representation is |Ai| = Poly(A^), which means fi(xj) 
can be explicitly expressed for each Xj G Aj as an |I|-dimensional array. 

T can be conveniently represented as a bipartite graph that includes two sets of nodes: variable 
nodes x*, and factor nodes I. A variable node i (note that we will often identify a variable Xj with its 
index “i”) is connected to a factor node I if and only if i G I -i.e., I is a set that is also an index. We will 
use d to denote the neighbours of a variable or factor node in the factor graph - that is 51 = {i | i G 1} 

(which is the set I) and di = {1 \ i G I}. Also, we use Ai to denote the Markov blanket of node x* - 
he., Ai = {j G 51 | I G di, j ^ i}. 

Example 1.1.2. Figure 1.1 shows a factor-graph with 12 variables and 12 factors. Here x = (xj, Xj,Xk,Xe, Xm, Xq, x, 
I = 51 = {i,j,k}, Xk = X{k,w,v} — {IjVjW}. Assuming = (M, min), the expanded form rep¬ 

resents 

q(x) = min{fi(xi),fj(xj),... ,fz(xz)}. 

Now, assume that all variables are binary - he., X = {0,1}^^ and q(x) is 12-dimensional hyper¬ 
cube, with one assignment at each corner. Also assume all the factors count the number of non¬ 
zero variables - e.g., for = (1,0,1) G A\v we have fw(:2w) = 2. Then, for the complete as¬ 
signment z = (0,1, 0,1, 0,1, 0,1,0,1,0,1) G X, it is easy to check that the expanded form is q(z) = 
min{2,0,1,..., 1} = 0. 

A marginalization operation shrinks the expanded form q(x) using another commutative semigroup 
with binary operation ©. Inference is a combination of an expansion and one or more marginalization 
operations, which can be computationally intractable due to the exponential size of the expanded form. 

Definition 1.1.3. Given a function q : Aj —)• T, and a commutative semigroup = (T*,©), where y* 
is the closure of y w.r.t. ©, the marginal of q for I C J is 

q(^j\i) 0q(^j) 


( 1 . 2 ) 
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Figure 1.1: A factor-graph with variables as circles and factors as squares. 


where 02 ;jq(xj) is short for q(xj\j, Xj), and it means to compute q(xj\j) for each Xj^j, one 

should perform the operation © over the set of all the assignments to the tuple Xj G Afi. 

We can think of q(xj) as a | J|-dimensional tensor and marginalization as performing © operation 
over the axes in the set I. The result is another | J \ 11-dimensional tensor (or function) that we call the 
marginal. Here if the marginalization is over all the dimensions in J, we denote the marginal by q(0) 
instead of q(x 0 ) and call it the integral of q. 

Now we define an inference problem as a sequence of marginalizations over the expanded form of a 
factor-graph. 

Definition 1.1.4. An inference problem seeks 

M M-l 1 


0^ •••0^ 0flfe) 


where 


1 M 

y* is the closure of y (the collective range of factors), w.r.t. ©,...,© and ©. 

m 

= (T*,©) VI < m < M and = (T*,©) are all commutative semigroups. 
Jo, •.., Jl partition the set of variable indices AA = {1 ,..., A^}. 
q(xjjj) has a polynomial representation in A^ - i.e., iTj^j = Poly(A^) 
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1 M 

Note that refer to potentially different operations as each belongs to a different semigroup. 

When Jo = 0, we call the inference problem integration (denoting the inquiry by q(0)) and otherwise 
we call it marginalization. Here, having a constant sized Jq is not always enough to ensure that q(Tjp) 
has a polynomial representation in N. This is because the size of q(Tjjj) for any individual G Xjg 
may grow exponentially with N {e.g., see Claim 1.2.1). In the following we call = (T*,©) the 

m 

expansion semigroup and = (T*,©) yi < m < M the marginalization semigroup. 

Example 1.1.3. Going back to example 1.1.3, the shaded region in figure 1.1 shows a partitioning of 
the variables that we use to define the following inference problem: 

q(lijo) = max ^ min min fi (xj) 

-Ja X, -Ji ^ 

-J2 

We can associate this problem with the following semantics: we may think of each factor as an agent, 
where fi(xj) is the payoff for agent I, which only depends on a subset of variables Xj. We have adversarial 
variables (xj^), environmental or chance variables (xj^), controlled variables (xj^) and query variables 
(xjg). The inference problem above for each query xseeks to maximize the expected minimum payoff 
of all agents, without observing the adversarial or chance variables, and assuming the adversary makes 
its decision after observing control and chance variables. 

Example 1.1.4. A “probabilistic” graphical model is defined using a expansion semigroup = 
x) and often a marginalization semigroup = (M-^, +). The expanded form represents the un¬ 
normalized joint probability q(x) = nifi(j'i)) whose marginal probabilities are simply called marginals. 
Replacing the summation with marginalization semigroup = (M-*^,max), seeks the maximum prob¬ 
ability state and the resulting integration problem q(0) = max^, ](([j fi(xj) is known as maximum a 
posteriori (MAP) inference. Alternatively by adding a second marginalization operation to the 
summation, we get the marginal MAP inference 

q(Tjo) = max 


1 2 

where here 0 = 0; 0 = X] and 0 = max. 

If the object of interest is the negative log-probability (a.k.a. energy), the product expansion semi¬ 
group is replaced by +)• Instead of sum marginalization semigroup, we can use the log-sum- 

exp semigroup, = (1^, +) where a© 6 log(e““ + e“^). The integral in this case is the log-partition 
function. If we change the marginalization semigroup to = (M, min), the integral is the minimum 
energy (corresponding to MAP). 

A well-known example of a probabilistic graphical model is the Ising model of ferromagnetism. 
This model is an extensively studied in physics, mainly to model the phase transition in magnets. The 
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model consists of binary variables (xi G { — 1 , 1 }) - denoting magnet spins - arranged on the nodes of a 
graph Q = (V,<S) (usually a grid or Cayley tree). The energy function (i.e., Hamiltonian) associated 
with a configuration x is the joint form 

e{x) = q(x) = - ^ XiJijXj-'^hi (1.5) 

(i,j)&£ *SV 

Variable interactions are denoted by J and h is called the local field. Here each Jij defines a factor 
over Xi,Xj\ = ~Xi Jij xj and local fields define local factors f[i}{xi) = —hiXi. 

Depending on the type of interactions, we call the resulting Ising model: 

• ferromagnetic, if all Jij > 0. In this setting, neighbouring variables are likely to take similar 
values. 

• anti-ferromagnetic, if all Jij < 0 . 

• non-ferromagnetic, if both kind of interactions are allowed. In particular, if the ferromagnetic and 
anti-ferromagnetic interactions have comparable frequency, the model is called spin-glass. This class 
of problem shows most interesting behaviours, which is not completely understood [ 212 ]. As we will see, 
the studied phenomena in these materials have important connections to difficult inference problems 
including combinatorial optimization problems. Two well studied models of spin glass are Edward- 
Anderson (EA [93]) and Sherrington-Kirkpatrick (SK [280]) models. While the EA model is defined on 
a grid (i.e., spin-glass interactions over a grid), the SK model is a complete graph. 

1.2 The inference hierarchy 

Often, the complexity class is concerned with the decision version of the inference problem in defi¬ 
nition 1.1.4. The decision version of an inference problem asks a yes/no question about the integral: 

7 

q(0) > q for a given q. 

Here, we produce a hierarchy of inference problems in analogy to polynomial [288], the counting [302] 
and arithmetic [264] hierarchies. 

To define the hierarchy, we assume the following in definition 1.1.4: 

i «-i-i 

- Any two consecutive marginalization operations are distinct (© 7 ^ © VI < Z < M). 

- The marginalization index sets J; VI < Z < M are non-empty. Moreover if jj;j = 0(log(V)) we 
call this marginalization operation a polynomial marginalization as here jAjJ = Poly(iV). 

- In defining the factor-graph, we required each factor to be polynomially computable. In building 
the hierarchy, we require the operations over each semigroup to be polynomially computable as well. 
To this end we consider the set of rational numbers y* C Q-^U{±oo}. Note that this automatically 
eliminates semigroups that involve operations such as exponentiation and logarithm (because Q 
is not closed under these operations) and only consider summation, product, minimization and 
maximization. 
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We can always re-express any inference problem to enforce the first two conditions and therefore 
they do not impose any restriction. In the following we will use the a language to identify inference 
problems for an arbitrary set of factors J- = {fi}. For example, sum-product refers to the inference 

7 

problem Xla; Hi fi( 21 i) ^ 9 - this sense the rightmost “token” in the language (here product) identifies 
the expansion semigroup = (Q,n) and the rest of tokens identify the marginalization semigroups 
over Q in the given order. Therefore, this minimal language exactly identihes the inference problem. The 
only information that affects the computational complexity of an inference problem but is not specified 
in this language is whether each of the marginalization operations are polynomial or exponential. 

We define five inference families: S, 11, T, A. The families are associated with that “outermost” 

M 

marginalization operation - i.e., © in definition 1.1.4). S is the family of inference problems where 

M 

© = sum. Similarly, 11 is associated with product, with minimization and T with maximization. A 

is the family of inference problems where the last marginalization is polynomial (i.e., | Jm| = d(log(A)) 

M 

regardless of ©). 

Now we define inference classes in each family, such that all the problems in the same class have 
the same computational complexity. Here, the hierarchy is exhaustive - i.e., it includes all inference 
problems with four operations sum, min, max and product whenever the integral q( 0 ) has a polynomial 
representation (see Claim 1.2.1). Moreover the inference classes are disjoint. For this, each family is 
parameterized by a subscript M and two sets S and T> {e.g., is an inference “class” in family 

<I>). As before, M is the number of marginalization operations, S is the set of indices of the (exponential) 
sum-marginalization and T) is the set of indices of polynomial marginalizations. 

Example 1.2.1. Sum-min-sum-product identifies the decision problem 

5J3 I 

where Ji, J 2 and J 3 partition M. Assume Ji = {2,...,y}, J 2 = {y + l,...,iV} and J 3 = {1}. 

Since we have three marginalization operations M = 3. Here the first and second marginalizations are 

exponential and the third one is polynomial (since IJ 3 I is constant). Therefore T> = {3}. Since the only 

1 

exponential summation is 0,^, = , 5 = {1}. In our inference hierarchy, this problem belongs to 

the class A 3 ({ 1 },{ 3 }). 

Alternatively, if we use different values for Ji, J 2 and J 3 that all linearly grow with N, the corre¬ 
sponding inference problem becomes a member of S 3 ({ 1 , 3 }, 0 ). 

Remark 1. Note that arbitrary assignments to M, S and D do not necessarily define a valid inference 
class. For example we require that 5 H H = 0 and no index in D and S are is larger than M. Moreover, 
the values in S and D should be compatible with the inference class. For example, for inference class 
Tim{S,T>), M is a member of S. For notational convenience, if an inference class notation is invalid we 
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equate it with an empty set - e.g., 'I'i({l},0) = 0, because S = {1} and M = 1 means the inference 
class is S rather than 

In the definition below, we ignore the inference problems in which product appears in any of the 
marginalization semigroups (e.g., product-sum). The following claim, explains this choice. 

Claim 1.2.1. For Q)m = prod, the inference query q(xjg) can have an exponential representation in 

N. 

Proof. The claim states that when the product appears in the marginalization operations, the marginal 
(and integral) can become very large, such that we can no longer represent them in polynomial space in 
N. We show this for an integration problem. The same idea can show the exponential representation 
of a marginal query. 

To see why this integral has an exponential representation in N, consider its simplified form 

q( 0 )= 

Xj 


where q(x) here is the result of inference up to the last marginalization step ©, which is product, where 
Tj grows exponentially with N. Recall that the hierarchy is defined for operations on Since q(a:j) 
for each Xj G Tj has a constant size, say c, the size of representation of q(0) using a binary scheme is 


[log2(q(0))l 


log2 


which is exponential in N. 


Define the base members of families as 




M-Tlll 


llo(0,0) = {sum} 
'ho(0,0) {max} 

Ao(0,0) = 0 


'ho(0,0) = {min} 
no(0,0) {prod} 

Ai(0, {1}) {sum — sum, min — min, max — max} 


□ 


( 1 . 6 ) 


where the initial members of each family only identify the expansion semigroup ~ e.g., sum in So(0,0) 
identifies q(x) = X]ifi(xi)- Here, the exception is Ai(0,{l}), which contains three inference problems.^ 
Let r.M{S,T)) denote the union of corresponding classes within all families: 


Em{S,V) = Tjm{S,V) U U <I)m(‘ 5,I1) U 'hM(5,Il) U Am{S,V) 

^We treat M = 1 for A specially as in this case the marginalization operation can not be polynomial. This is because 
if |Ji| = 0{\og{N)), then |Jo| = H(A) which violates the conditions in the definition of the inference problem. 
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Now define the inference family members recursively, by adding a marginalization operation to all 
the problems in each inference class. If this marginalization is polynomial then the new class belongs 
to the A family and the set T) is updated accordingly. Alternatively, if this outermost marginalization 
is exponential, depending on the new marginal operation (i.e., min, max, sum) the new class is defined 
to be a member of <1>, or S. For the case that the last marginalization is summation set S is updated. 

• Adding an exponential marginalization V|Aj^| =Poly(iV),M>0 

Sm+i(5 U {M + 1}, P) {sum - ^ I ^ G Sm(5, P) \ Sm(5, V)} (1.7) 

$m+i(5, V) { min I e G ^m{S, V) \ $^(5, V)] 

^M+i{S,V) =^{max-e I e G ^m{S,V)\^m{S,V)] 

IiM+i{S,V)=% 

• Adding a polynomial marginalization V|Aj^| =Poly(A),M>l 

Aa^_i_i(5, V U {M + 1}) { © —^ I ^ G P) , © G {min, max, sum}} (1.8) 

1.2.1 Single marginalization 

The inference classes in the hierarchy with one marginalization are 


Ai(0, {1}) = {min — min, max —max, sum — sum} (1-9) 

'l'i(0,0) = {max — min, max—sum, max—prod} (1-10) 

<hi(0,0) = {min — max, min—sum, min—prod} (1-11) 

Si({l},0) = {sum — prod, sum — min, sum — max} (1-12) 


Now we review all the problems above and prove that Ai, Ti, <l>i and Si are complete w.r.t. P, NP, 
coNP and PP respectively. Starting from Ai: 

Proposition 1.2.2. sum-sum, min-min and max-max inference are in P. 

Proof. To show that these inference problems are in P, we provide polynomial-time algorithms for them; 
• sum — sum is short for 

q(0) = 

X I 


which asks for the sum over all assignments of x G A, of the sum of all the factors. It is easy to see 
that each factor value fi(;rj) VI, Xi is counted |A\i| times in the summation above. Therefore we can 
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rewrite the integral above as 


q(0) = 

I xj 

where the new form involves polynomial number of terms and therefore is easy to calculate. 

• min — min (similar for max — max) is short for 

q(0) = minminfi(xj) 

X I 

where the query seeks the minimum achievable value of any factor. We can easily obtain this by seeking 
the range of all factors and reporting the minimum value in polynomial time. □ 

Max-sum and max-prod are widely studied and it is known that their decision version are NP- 
complete [281]. By reduction from satisfiability we can show that max-min inference [258] is also 
NP-hard. 

7 

Proposition 1.2.3. The decision version of max-min inference that asks max^; minj fi(xj) > q is NP- 
complete. 

Proof. Given x it is easy to verify the decision problem, so max-min decision belongs to NP. To show 
NP-completeness, we reduce the 3-SAT to a max-min inference problem, such that 3-SAT is satisfiable 
iff the max-min value is q(0) > 1 and unsatisfiable otherwise. 

Simply define one factor per clause of 3-SAT, such that fi(xj) = 1 if Xj satisfies the clause and 
any number less than one otherwise. With this construction, the max-min value max^, minigj-fi(xj) is 
one iff the original SAT problem was satisfiable, otherwise it is less than one. This reduces 3-SAT to 
Max-Min-decision. □ 

This means all the problems in \ki(0,0) are in NP (and in fact are complete w.r.t. this complexity 

class). In contrast, problems in <I>i(0,0) are in coNP, which is the class of decision problems in which 

the “NO instances” result has a polynomial time verifiable witness or proof. Note that by changing the 

? ? 

decision problem from q(0) > g to q(0) < q, the complexity classes of problems in d> and T family are 
reversed (i.e., problems in <I>i(0,0) become NP-complete and the problems in Ti(0,0) become coNP- 
complete). 

Among the members of Si({l},0), sum-product is known to be PP-complete [194, 267]. It is easy 
to show the same result for sum-min (sum-max) inference. 

? 

Proposition 1.2.4. The sum-min decision problem mini fi(Ti) ^ Q is FF-complete for y = {0,1}. 

PP is the class of problems that are polynomially solvable using a non-deterministic Turing machine, 
where the acceptance condition is that the majority of computation paths accept. 
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Proof. To see that mini fi(xj) > is in PP, enumerate all x G T non-deterministically and for each 
assignment calculate minifi(xi) in polynomial time (where each path accepts iff minifi(xi) = 1) and 
accept iff at least q of the paths accept. 

Given a matrix A G {0, the problem of calculating its permanent 

N 

perm(A) = ^ 

where iStv is the set of permutations of 1,..., A is ^P-complete and the corresponding decision problem 
is PP-complete [297]. To show completeness w.r.t. PP it is enough to reduce the problem of computing 
the matrix permanent to sum-min inference in a graphical model. The problem of computing the 
permanent has been reduced to sum-product inference in graphical models [139]. However, when fi(xi) G 
{0,1} VI, sum-product is isomorphic to sum-min. This is because yi x y 2 = min(?/i, ?/ 2 )Vyi G {0,1}. 
Therefore, the problem of computing the permanent for such matrices reduces to sum-min inference in 
the factor-graph of [139]. □ 

1.2.2 Complexity of general inference classes 

Let 15(.) denote the complexity class of an inference class in the hierarchy. In obtaining the complexity 
class of problems with M > 1 , we use the following fact, which is also used in the polynomial hierarchy: 
]pNP _ ]pcoNP j 24 ]. In fact £qj. Qp^cle A. This means that by adding a polyno¬ 
mial marginalization to the problems in <1^(5, 2 ?) and we get the same complexity class 

15(Am-i-i(5, P U {M -|- 1})). The following gives a recursive definition of complexity class for problems 
in the inference hierarchy.'^ Note that the definition of the complexity for each class is very similar to 
the recursive definition of members of each class in equations (1.7) and (1.8) 

Theorem 1.2.5. The complexity of inference classes in the hierarchy is given by the recursion 


U{^M+i{S,V)) = (1.13) 

G(Tm+i(5,T')) (1.14) 

U{T,m+i{S U {M + 1},V)) = ipip«(Sm(5,d)\Sm(5,d)) 

G(Am+i(5,T>U{M + 1})) =]p^i^Mis,v)) (£ £g) 


where the base members are defined in equation (1.6) and belong to P. 

Proof. Recall that our definition of factor graph ensures that q(x) can be evaluated in polynomial time 
and therefore the base members are in P (for complexity of base members of A see proposition 1.2.2). 

® We do not prove the completeness w.r.t. complexity classes beyond the first level of the hierarchy and only assert the 
membership. 
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We use these classes as the base of our induction and assuming the complexity classes above are correct 
for M we show that are correct for M + 1. We consider all the above statements one by one: 

• Complexity for members of 

Adding an exponential-sized min-marginalization to an inference problem with known complexity A, 
requires a Turing machine to non-deterministically enumerate G possibilities, then call the A 
oracle with the “reduced factor-graph” ~ in which Xj^ is clamped to Zj^ - and reject iff any of the 
calls to oracle rejects. This means ?J(<1 >m-i-i( 5,P)) = coNP^. 

Here, equation (1.13) is also making another assumption expressed in the following claim. 

Claim 1.2.6. All inference classes in 'Em{S,V) \ ^m{S,V) have the same complexity A. 

• M = 0: the fact that q(x) can be evaluated in polynomial time means that A = P. 

• M > 0: \ only contains one inference class - that is exactly only one of the 

following cases is correct: 

- MgS ^ Hm(5,P) \ = Sm(5,P) 

- MgV ^ '^m{S,V) \ ^m{S,V) = Am{S,V) 

-MiScV ^ Hm(5,P)\$m(5,P) = Tm(5,P). 

(in constructing the hierarchy we assume two consecutive marginalizations are distinct and 
the current marginalization is a minimization.) 

But if 'Rm{S,V) \ ^m{S,T>) contains a single class, the inductive hypothesis ensures that all 
problems in 'Em{S,V) \ have the same complexity class A. 

This completes the proof of our claim. 

• Complexity for members of 

Adding an exponential-sized max-marginalization to an inference problem with known complexity A, 
requires a Turing machine to non-deterministically enumerate Zj^ G possibilities, then call the 
A oracle with the reduced factor-graph and accept iff any of the calls to oracle accepts. This means 
?J(Tm-i-i(‘ 5, P)) = Here, an argument similar to that of Claim 1.2.6 ensures that ^^(5,2?) \ 

in equation (1.14) contains a single inference class. 

• Complexity for members of Sm-i-i(5 U {M -|- 

Adding an exponential-sized sum-marginalization to an inference problem with known complexity A, 
requires a Turing machine to non-deterministically enumerate G possibilities, then call the A 
oracle with the reduced factor-graph and accept iff majority of the calls to oracle accepts. This means 
U{^M+i{S,V))=W^. 

• M = 0: the fact that q(x) can be evaluated in polynomial time means that A = P. 


• M > 0: 
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- M E D => Em{S,V) \ Tjm{S,V) = Am{S,V). 

— M ^ V U S Em{S,V) \ Tjm{S,V) = U ^m{S,V)\ despite the fact that 

A = 13{^m{S-,T>)) is different from A' = since PP is closed under complement, 

which means PP^ = PP'^ and the recursive definition of complexity equation (1.15) remains 
correct. 

• Complexity for members of Am+i(‘5, V U {M + 1}): 

Adding a polynomial-sized marginalization to an inference problem with known complexity A, requires 
a Turing machine to deterministically enumerate E possibilities in polynomial time, and each 
time call the A oracle with the reduced factor-graph and accept after some polynomial-time calculation. 
This means P)) = P^. Here, there are three possibilities: 

• M = 0: here again A = P. 

• M E 5 => r.M{S,'D) = 

• M £T) ^ Em{S,V) = Am{S,V). 

• M ^ V LI S Em{S,V) = U ^m{‘S,'D), in which case since PP^'^ = pjpcoHP ^ 

recursive definition of complexity in equation (1.16) remains correct. 

□ 

Example 1.2.2. Consider the marginal-MAP inference of equation (1.4). The decision version of this 

7 

problem, q(0) > g, is a member of T^dl}, 0) which also includes max —sum —min and max —sum—max. 
The complexity of this class according to equation (1.14) is ^(T^dl}, 0)) = NP*^*^. However, marginal- 
MAP is also known to be “complete” w.r.t. NP'^^ [241]. Now suppose that the max-marginalization over 
Xj 2 is polynomial {e.g., | J 2 I is constant). Then marginal-MAP belongs to A2({1}, {2}) with complexity 
pPP^ This is because a Turing machine can enumerate all Zj^ E in polynomial time and call its PP 
oracle to see if 

7 

q(Ajo I ^Ja) > Q 

where q (xUj^) = X] 11 T ’ ^inJ 2 ) 

5J2 I 

and accept if any of its calls to oracle accepts, and rejects otherwise. Here, b(xjy^ > ^inJ 2 ) reduced 

factor, in which all the variables in Xj^ are fixed to ^jjni- 

The example above also hints at the rationale behind the recursive dehnition of complexity class for 
each inference class in the hierarchy. Consider the inference family 4>: 

Here, Toda’s theorem [296] has an interesting implication w.r.t. the hierarchy. This theorem states 
that PP is as hard as the polynomial hierarchy, which means min — max — min — ... — max inference for 
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an arbitrary, but constant, number of min and max operations appears below the sum-product inference 
in the inference hierarchy. 

1.2.3 Complexity of the hierarchy 

By restricting the domain y* to {0,1}, min and max become isomorphic to logical AND (A) and OR (V) 
respectively, where 1 = true, 0 = false. By considering the restriction of the inference hierarchy to 
these two operations we can express quantified satisfiability (QSAT) as inference in a graphical model, 
where A = V and V = 3. Let each factor fi(Ti) be a disjunction -e.g., = Xj V -iXj V -■Xfc. Then 

we have 

\/xj /\fi(xj) = min max ... maxminminfi(xj) 

M M — 1 2 !'• X X ^ X j X j T 

j -Jm —JM-1 -^2 -Jl 

By adding the summation operation, we can express the stochastic satisfiability [194] and by generalizing 
the constraints from disjunctions we can represent any quantified constraint problem (QCP) [36]. QSAT, 
stochastic SAT and QCPs are all PSPACE-complete, where PSPACE is the class of problems that can 
be solved by a (non-deterministic) Turing machine in polynomial space. Therefore if we can show that 
inference in the inference hierarchy is in PSPACE, it follows that inference hierarchy is in PSPACE- 
complete as well. 

Theorem 1.2.7. The inference hierarchy is ¥E>FACE-complete. 


Proof, (theorem 1.2.7) To prove that a problem is PSP ACE-complete, we have to show that 1) it is 
in PSPACE and 2) a PSPACE-complete problem reduces to it. We already saw that QSAT, which is 
PSPACE-complete, reduces to the inference hierarchy. But it is not difficult to show that inference 
hierarchy is contained in PSPACE. Let 


M 


M-1 


qfeo) = 0, 0, 


.0^ 0fl(Ai) 


be any inference problem in the hierarchy. We can simply iterate over all values of z G A in nested 
loops or using a recursion. Let j{i) : {l,...,Ai} —>■ {1,...,M} be the index of the marginalization 
that involves x* - that is f G Moreover let ii,... ,iN be an ordering of variable indices such that 
j(ffc) < Algorithm 1 uses this notation to demonstrate this procedure using nested loops. Note 

that here we loop over individual domains rather than Aj^ and track only temporary tuples , so 
that the space complexity remains polynomial in N. 


□ 



22 


CHAPTER 1. REPRESENTATION, COMPLEXITY AND REDUCIBILITY 


M M-1 1 

input :0 0 •••0. 01 fife) 

-Jm —^M-l -^1 

output: q(xjJ 

for each G do // loop over the query domain 
for each Zi^ G Xi^ do // loop over Xi^ 


for each Zi-^ G Xi^ do // loop over Xi^ 

I qi(tii) := 0ifi(ti); 

end 

i(*i) 

q*2(^i2) := 0-qifei) 


qY(tijv) := 0 xi„ qY-ifejv-i) 

end 

3{iN) 

qfeo) := 0 x,^qYfe^) 

end 

Algorithm 1: inference in PSPACE 


1.3 Polynomial-time inference 

Our definition of inference was based on an expansion operation ® and one or more marginalization 
1 M 

operations If we assume only a single marginalization operation, polynomial time inference is 

still not generally possible. However, if we further assume that the expansion operation is distributive 
over marginalization and the factor-graph has no loops, exact polynomial time inference is possible. 

Definition 1.3.1. A commutative semiring = (y*,©,©) is the combination of two commutative 
semigroups = (y*) ©) and = (y*, ©) with two additional properties 

0 0 © 0 0 

• identity elements 1 and 1 such that 1 © a = a and 1 © a = a. Moreover 1 is an annihilator for 

0 0 

= (®,y*); a© 1 = 1 VaGy©^ 

• distributive property: 

a © (6 © c) = (a © 6) © (a © 6) Va, b,c G y* 

© 

That is when dealing with reals, this is 1 = 0; this means a x 0 = 0. 


4 
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The mechanism of efficient inference using distributive law can be seen in a simple example: instead 
of calculating min(a + 6 , a + c), using the fact that summation distributes over minimization, we may 
instead obtain the same result using a + min( 6 , c), which requires fewer operations. 

Example 1.3.1. The following are some examples of commutative semirings: 

- Sum-product x). 

- Max-product (M-° U {—oo},max, x) and ({0, l},max, x). 

- Min-max (5, min, max) on any ordered set S. 

- Min-sum (M U {oo}, min,-|-) and ({0,1}, min,-|-). 

- Or-and ({true, false}, V, A). 

- Union-intersection (2‘^,U,n) for any power-set 2“^. 

- The semiring of natural numbers with greatest common divisor and least common multiple {J\f, Icm, gcd). 

- Symmetric difference-intersection semiring for any power-set (2*^, V,n). 

Many of the semirings above are isomorphic -e.g., y' = — log(y) defines an isomorphism between 
min-sum and max-product. It is also easy to show that the or-and semiring is isomorphic to min- 
sum/max-product semiring on y* = { 0 , 1 }. 

The inference problems in the example above have different properties indirectly inherited from their 

commutative semirings: for example, the operation min (also max) is a choice function, which means 

minag_ 4 a G A. The implication is that if sum of the semiring is min (or max), we can replace it with 

arg max and (if required) recover q(0) using q(0) = 0Tfi(;r*) in polynomial time. 

-Jm 

As another example, since both operations have inverses, sum-product is a field [247]. The avail¬ 
ability of inverse for ® operation ~ f.e., when is an Abelian group ~ has an important implication 
for inference: the expanded form of equation (1.1) can be normalized, and we may inquire about nor¬ 
malized marginals 


p(tj) = 0 p(x) 

3I\j 

where p(x) 


q(0) 


((0fi(s)) 


Hpf ® w ® 

p(x) = 1 if q( 0 ) = 1 


(1.17) 

if q( 0 ) ^ 1 (1.18) 

(1.19) 


where p(x) is the normalized joint form. We deal with the case where the integral evaluates to the 
annihilator as a special case because division by annihilator may not be well-defined. This also means, 
when working with normalized expanded form and normalized marginals, we always have p(tj) = 1 

Example 1.3.2. Since x) and = (1^,+) are both Abelian groups, min-sum and sum- 

sum 

product inference have normalized marginals. For min-sum inference this means mina;^ p(xj) = 1 = 0. 
However, for min-max inference, since (5, max) is not Abelian, normalized marginals are not defined. 
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We can apply the identity and annihilator of a commutative semiring to define constraints. 

r® 

Definition 1.3.2. A constraint is a factor fj : Ai —)• {1,1} whose range is limited to identity and 
annihilator of the expansion monoid. 

© ® 

Here, fi(x) = 1 iff x is forbidden and fi(x) = 1 iff it is permissible. A constraint satisfaction 
problem (CSP) is any inference problem on a semiring in which all factors are constraints. Note that 
this allows definition of the “same” CSP on any commutative semiring. The idea of using different 
semirings to define CSPs has been studied in the past [33], however its implication about inference on 
commutative semirings has been ignored. 

Theorem 1.3.1. Inference in any commutative semiring is N¥-hard under randomized polynomial-time 
reduction. 

© ® 

Proof. To prove that inference in any semiring = (T*, 1, 1) is NP-hard under randomized polynomial 
reduction, we deterministically reduce unique satisfiability (USAT) to an inference problems on any 
semiring. US AT is a so-called “promise problem”, that asks whether a satisfiability problem that is 
promised to have either zero or one satisfying assignment is satisfiable. Valiant and Vazirani [298] prove 
that a polynomial time randomized algorithm (MP) for USAT implies a MP=NP. 

For this reduction consider a set of binary variables t G {0,1}^, one per each variable in the given 

© 

instance of USAT. For each clause, define a constraint factor fj such that fi(xj) = 1 if Xj satisfies 

© 

that clause and fi(xi) = 1 otherwise. This means, x is a satisfying assignment for USAT iff q(x) = 
® © © © 
©ifi(l'i) = 1- If the instance is unsatisfiable, the integral q(0) = 0,^, 1 = 1 (by definition of 1). If the 

© 

instance is satisfiable there is only a single instance x* for which q(x*) = 1, and therefore the integral 

© 

evaluates to 1. Therefore we can decide the satisfiability of USAT by performing inference on any 
semiring, by only relying on the properties of identities. The satisfying assignment can be recovered 
using a decimation procedure, assuming access to an oracle for inference on the semiring. 

□ 

Example 1.3.3. Inference on xor-and semiring ({true, false}, xor. A), where each factor has a dis¬ 
junction form, is called parity-SAT, which asks whether the number of SAT solutions is even or odd. A 
corollary to theorem 1.3.1 is that parity-SAT is NP-hard under randomized reduction, which is indeed 
the case [298]. 


® Recall that a monoid is a semigroup with an identity. The existence of identity here is a property of the semiring. 
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We find it useful to use the same notation for the identity function 1 (condition): 

(+, x) (min,+) (min,max) 

1 0 —oo 

0 +00 +00 

where the intended semiring for 1(.) function will be clear from the context. 


l(cond.) < 


cond. = TRUE 
cond. = FALSE 


( 1 . 20 ) 


1.4 Reductions 

Several of the inference problems over commutative semirings are reducible to each other. Section 1.4.1 
reviews the well-known reduction of marginalization to integration for general commutative semirings. 
We use this reduction to obtain approximate message dependencies in performing loop corrections in 
section 2.4. 

In equation (1.22), we introduce a procedure to reduce integration to that of finding normalized 
marginals. The same procedure, called decimation, reduces sampling to marginalization. The problem 
of sampling from a distribution is known to be almost as difficult as sum-product integration [151]. As 
we will see in chapter 3, constraint satisfaction can be reduced to sampling and therefore marginalization. 
In section 2.6.3 we introduce a perturbed message passing scheme to perform approximate sampling and 
use it to solve CSPs. Some recent work perform approximate sampling by finding the MAP solution 
in the perturbed factor-graph, in which a particular type of noise is added to the factors [125, 239]. 
Approximate sum-product integration has also been recently reduced to MAP inference [96, 97]. In 
section 2.3, we see that min-max and min-sum inference can be obtained as limiting cases of min-sum 
and sum-product inference respectively. 

Section 1.4.2 reduces the min-max inference to min-sum also to a sequence of CSPs (and therefore 
sum-product inference) over factor-graphs. This reduction gives us a powerful procedure to solve min- 
max problems, which we use in part II to solve bottleneck combinatorial problems. 

In contrast to this type of reduction between various modes of inference, many have studied reduc¬ 
tions of different types of factor-graphs [90]. Some examples of these special forms are factor-graphs with: 
binary variables, pairwise interactions, constant degree nodes, and planar form. For example Sanghavi 
et al. [274] show that min-sum integration is reducible to maximum independent-set problem. However 
since a pairwise binary factor-graph can represent a maximum independent-set problem (see section 3.7), 
this means that min-sum integration in any factor-graph can be reduced to the same problem on a pair¬ 
wise binary model. 

These reductions are in part motivated by the fact that under some further restrictions the restricted 
factor-graph allows more efficient inference. For example, (I) it is possible to calculate the sum-product 
integral of the planar spin-glass Ising model (see example 1.1.4) in polynomial time, in the absence of 
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local fields [99]; (II) the complexity of the loop correction method that we study in section 2.4.2 grows 
exponentially with the degree of each node and therefore it may be benehcial to consider reduced factor- 
graph where \di\ = 3; and (III) if the factors in a factor-graphs with pairwise factors satisfy certain 
metric property, polynomial algorithms can obtain the exact min-sum integral using graph-cuts [41]. 

1.4.1 Marginalization and integration 

This section shows how for arbitrary commutative semirings there is a reduction from marginalization 
to integration and vice versa. 

Marginalization rednces to integration 

For any fixed assignment to a subset of variables = Zp^ (a.k.a. evidence), we can reduce all the 
factors fi(xj) that have non-empty intersection with A (he., I n A 7 ^ 0) accordingly: 

h\A{xi\A I ^a) 0 fife) <8) l(xinA = ^iha) VI s.t. A n I / 0 ( 1 . 21 ) 

^inA 

where the identity function 1(.) is defined by equation (1.20). The new factor graph produced by 
clamping all factors in this manner, has effectively accounted for the evidence. Marginalization or 
integration, can be performed on this reduced factor-graph. We use similar notation for the integral and 
marginal in the new factor graph - i.e., q(0 | Xp) and q(xB | xjf). Recall that the problem of integration 
is that of calculating q(0). We can obtain the marginals q(2A) integration on reduced factor-graphs 
for all za ^ '^A reductions. 

Claim 1.4.1. 


q(^A) = q(01 ^a) (1-22) 

Proof. 

9(^a) = 00fife\A>^Ani) 

X\A I 

= © (ifeA = 2 a) ® ® fl{$I\A. iArSj 

= ©©( fA(xA) ® l(xinA = ^iha) ) 

= 00fi\A(^i\A feA) = q(0 I^a) 

X I 

□ 
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where we can then normalize q(xi) values to get p(xi) (as defined in equation (1.17)). 

Integration reduces to marginalization 

Assume we have access to an oracle that can produce the normalized marginals of equation (1.17). 
We show how to calculate q(0) by making N calls to the oracle. Note that if the marginals are not 
normalized, the integral is trivially given by q(0) = 0^^ q(Tj) 

Start with t = 1, B(t = 0) = 0 and given the normalized marginal over a variable p(xj(j)), fix the 
Xjp) to an arbitrary value G Then reduce all factors according to equation (1.21). Repeat this 

process of marginalization and clamping N times until all the variables are fixed. At each point, B(t) 
denotes the subset of variables fixed up to step t (including i{t)) and p(xjp) | :2B(f-i)) = 
refers to the new marginal. Note that we require i{t) ^ B(t — 1) - that is at each step we fix a different 
variable. 

© © 

We call an assignment to invalid, if p(^i(t) | :ZB{t)) = 1- This is because 1 is the annihilator 

of the semiring and we want to avoid division by the annihilator. Using equations (1.17) to (1.19), it is 

© 

easy to show that if q(0) / 1, a valid assignment always exists (this is because 03 ,.P(Ti(t) | ^B{t-i)) = 

1). Therefore if we are unable to find a valid assignment, it means q(0) = 1. 

Let z = ^b(n+i) denote the final joint assignment produced using the procedure above. 


Proposition 1.4.2. The integral in the original factor-graph is given by 


-1 


q(0) = 


Ifi(a) ® 




(1.23) 


A<t<N 


where the inverse is defined according to ^-operation. 

Proof. First, we derive the an equation for “conditional normalized marginals” for semirings where 
defines an inverse. 

Claim 1.4.3. For any semiring with normalized joint form we have 

p(t) = P(a;i) ® p(x\j I Xi) 




where p(x\i | Xi) = 

To arrive at this equality first note that since x = Xy,Xi, q(x\j | Xi) = q(a;). Then multiply both 
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sides by q(xi) = q(0 | Xi) (see Claim 1.4.1) to get 


q(x) (g) q(0 I a:*) = q(xi) (g) q(®\j I Xi) 
q(x) _ q(xi) q(x\i I Xj) 

q(0) q(0) q(0 | x*) 


p(x) = p(Xi) (g) p(x\j I Xi) 


where we divided both sides by q(0) and moved a term from left to right in the second step. 
Now we can apply this repeatedly to get a chain rule for the semiring: 


p(x) = p(XiJ (g) p(Xi 2 I XiJ (g) p(Xi 3 I X{i,^i^}) (g) . . . (g) p(Xi^ I X{i^^ 
which is equivalent to 

p(x) = p(Xi(i))®p(Xi( 2 ) I XB(i))®...®p(Xi(Ar) I XB(Ar_i)) = P(^ip) UB(i-l)) 

l<t<N 

Simply substituting this into definition of p(x) (equation (1.18)) and re-arranging we get 

1 


P(®i(t) I ®B(t-l)) — 


l<t<N 


q( 


fife) 


leJ' 




,l<t<N 


□ 


The procedure of incremental clamping is known as decimation, and its variations are typically 
used for two objectives: (I) recovering the MAP assignment from (max) marginals (assuming a max- 
product semiring). Here instead of an arbitrary Zj G Aj, one picks Zj = arg^,^ max p(xj). (II) producing 
an unbiased sample from a distribution p(.) (he., assuming sum-product semiring). For this we sample 
from p(xi): Zj ~ p(xj). 


1.4.2 Min-max reductions 

The min-max objective appears in various helds, particularly in building robust models under uncertain 
and adversarial settings. In the context of probabilistic graphical models, several min-max objectives 
different from inference in min-max semiring have been previously studied [140, 162] (also see sec¬ 
tion 2.1.2). In combinatorial optimization, min-max may refer to the relation between maximization 
and minimization in dual combinatorial objectives and their corresponding linear programs [276], or it 
may refer to min-max settings due to uncertainty in the problem specification [5, 16]. 
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In part II we will see that several problems that are studied under the class of bottleneck prob¬ 
lems can be formulated using the min-max semiring. Instances of these problems include bottleneck 
traveling salesman problem [242], K-clustering [119], K-center problem [87, 164] and bottleneck assign¬ 
ment problem [121]. 

Edmonds and Fulkerson [92] introduce a bottleneck framework with a duality theorem that relates 
the min-max objective in one problem instance to a max-min objective in a dual problem. An intuitive 
example is the duality between the min-max cut separating nodes a and b - the cut with the minimum 
of the maximum weight - and min-max path between a and b, which is the path with the minimum 
of the maximum weight [104], Hochbaum and Shmoys [136] leverages the triangle inequality in metric 
spaces to hnd constant factor approximations to several NP-hard min-max problems under a unihed 
framework. 

The common theme in a majority of heuristics for min-max or bottleneck problems is the relation 
of the min-max objective to a CSP [136, 237]. We establish a similar relation within the context of 
factor-graphs, by reducing the min-max inference problem on the original factor-graph to inference over 
a CSP factor-graph (see section 1.3) on the reduced factor-graph in equation (1.25). In particular, since 
we use sum-product inference to solve the resulting CSP, we call this reduction, sum-product reduction 
of min-max inference. 

Min-max reduces to min-sum 

Here, we show that min-max inference reduces to min-sum, although in contrast to the sum-product 
reduction of the next subsection, this is not a polynomial time reduction. First, we make a simple 
observation about min-max inference. Let y = Uig denotes the union over the range of all factors. 
The min-max value belongs to this set maxigj-fi(xj) G y. In fact for any assignment x, maxigj-fi(xj) G 

T. 

Now we show how to manipulate the factors in the original factor-graph to produce new factors over 
the same domain such that the min-max inference on the former corresponds to the min-sum inference 
on the later. 

Lemma 1.4.4. Any two sets of factors, {fj} and {gi}, over the identical domains {Aj} have identical 
min-max solutions 


arg^, min max fj (xj) = arg^, min max gi (xj) 

if VI, J G J~, Xj G Aj, Xj G Aj 


fl(s) < ^ gife) < gjfe) 
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Proof. Assume they have different min-max assignments*’ -i.e., x* = arg^, minmaxi fi(xi), x** = arg^, minmaxi 
and X* 7 ^ x'*. Let y* and y'* denote the corresponding min-max values. 

Claim 1.4.5. 


y* > maxfi(x'/) 44 y'* < maxgi(xi) 

y* < maxfi(xj*) 44 y'* > maxf{(xjf) 

This simply follows from the condition of the Lemma. But in each case above, one of the assignments 
y* or y'* is not an optimal min-max assignment as there is an alternative assignment that has a lower 
maximum over all factors. □ 

This lemma simply states that what matters in the min-max solution is the relative ordering in the 
factor-values. 

Let y[l\ < ... < y[|T|] be an ordering of elements in y, and let r{fi{xi)) denote the rank in 
|T|} of yi = fi(xj) in this ordering. Define the min-sum reduction of {fijigj- as 

gl(xi) = VI G 


Theorem 1.4.6. 


argj, min gl(xj) = arg^, minmaxfi(xj) 

I ^ 

where {gi}i is the min-sum reduction o/{fi}i. 

Proof. First note that since g{z) = 2^ is a monotonically increasing function, the rank of elements in 
the range of {gi}i is the same as their rank in the range of {fi}i. Using Lemma 1.4.4, this means 

arg^, min max gi (xj) = arg^, min max fi (xj). (1-24) 

Since 2^ > Yli=o 2^ by definition of {gi} we have 

maxgi(xi) > E gi(xj) where T = argj maxgi(xj) 

®Por simplicity, we are assuming each instance has a single min-max assignment. In case of multiple assignments there is 
a one-to-one correspondence between them. Here the proof instead starts with the assumption that there is an assignment 
P for the first factor-graph that is different from all min-max assignments in the second factor-graph. 
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It follows that for G X, 

maxgi(xj) < maxgi(xj) gl(®l) 

1 I 


Therefore 


argj, minmaxgi(xj) = arg^, min E gi(s)- 

I 

This equality, combined with equation (1.24), prove the statement of the theorem. □ 

An alternative approach is to use an inverse temperature parameter /3 and re-state the min-max 
objective as the min-sum objective at the low temperature limit 

lim argj, min f^(xj) = arg,^ minmaxfi(xj) (1-25) 

/3^+oo — ^^ — 1 


Min-max reduces to sum-product 

Recall that y = UieJ"^! denote the union over the range of all factors. For any y G y, we reduce the 
original min-max problem to a CSP using the following reduction. 

Definition 1.4.1. For any y G y, Pj^-reduction of the min-max problem: 


X* = arg^, min m^ fllHi) (1-26) 

is given by 

2 ) = ( 1 - 27 ) 

where qy(0) is the normalizing constant.^ 

This distribution defines a CSP over X, where Py(x) > 0 iff x is a satisfying assignment. Moreover, 
qy(0) gives the number of satisfying assignments. The following theorem is the basis of our reduction. 

Theorem 1.4.7. Let x* denote the min-max solution and y* be its corresponding value -i.e., y* = 
maxi fi(xj). Then Py{x) is satisfiable for all y > y* (in particular Py{x*) > Oj and unsatisfiable for all 

- 3k 

y <y ■ 

^ To always have a well-defined probability, we define ^ 0. 
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Proof. (A) py for y > y* is satisfiable: It is enough to show that for any y > y*, PyisA) > 0. But since 

and fi(Tj) <y*<y, all the indicator functions on the rhs evaluate to 1, showing that Py{x*) > 0. 

(B) Py for y < y* is not satisfiable: Towards a contradiction assume that for some y < y*, Py is 
satisfiable. Let x denote a satisfying assignment -i.e., Py{x) > 0. Using the definition of p^-reduction, 
this implies that l(fi(Xj) < y) > 0 for all 1 € J-. However this means that maxifi(Xj) < y < y*, which 
means y* is not the min-max value. □ 

This theorem enables us to find a min-max assignment by solving a sequence of CSPs. Let 7/[l] < 
• • • < 2/[|3^|] be an ordering of y G y. Starting from y = 7/[|'N/2]], if py is satisfiable then y* < y. On 
the other hand, if Py is not satisfiable, y* > y. Using binary search, we need to solve log(|T|) CSPs 
to find the min-max solution. Moreover at any time-step during the search, we have both upper and 
lower bounds on the optimal solution. That is y < y* < y, where Pj; is the latest unsatisfiable and Py 
is the latest satisfiable reduction. 

However, finding an assignment x* such that Py{x*) > 0 or otherwise showing that no such assign¬ 
ment exists, is in general, NP-hard. Instead, we can use an incomplete solver [160], which may find a 
solution if the CSP is satisfiable, but its failure to find a solution does not guarantee unsatisfiability. 
By using an incomplete solver, we lose the lower bound y on the optimal min-max solution.® However 
the following theorem states that, as we increase y from the min-max value y*, the number of satisfying 
assignments to p^-reduction increases, making it potentially easier to solve. 

Proposition 1.4.8. 


yi<y2 => q2;i(0) < q?;2(0) V?/1,2/2GT 

where qj;(0) (i.e., partition function) is the number of solutions of Py-reduction. 
Proof. Recall the definition qy(0) = Hi l(fl(iLi) ^ y)- For yi < 2/2 we have: 

fl(Ti) <2/1 ^ fl(Ti) < 222 

l(fl(xi) < 2/i) < l(fl(xi) < 2/2) 

Yl n ^ yi) ^ X] n ^ y 2 ) 

X I X I 

q?;i( 0 ) < qj/2(0) 




□ 


To maintain the lower bound one should be able to correctly assert unsatisfiability. 
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This means that the sub-optimality of our solution is related to our ability to solve CSP-reductions 
- that is, as the gap y — y* increases, the p^-reduction potentially becomes easier to solve. 



Chapter 2 

Approximate inference 


2.1 Belief Propagation 

A naive approach to inference over commutative semirings 

qfej) = 0®fife) (2.1) 

*\j I 

or its normalized version (equation (1.17)), is to construct a complete A-dimensional array of q(x) using 
the tensor product q(x) = (^jfi(xi) and then perform ©-marginalization. However, the number of 
elements in q(x) is 1^1, which is exponential in N, the number of variables. 

If the factor-graph is loop free, we can use distributive law to make inference tractable. Assuming 
q(.liK) (o^ q(®A:)) is the marginal of interest, form a tree with K (or k) as its root. Then starting from 
the leaves, using the distributive law, we can move the © inside the © and define “messages” from leaves 
towards the root as follows: 


qi^i(xj) = 0 qj^i(xj) 

( 2 . 2 ) 

Je9i\I 


qi^i(xO = 0 fi(a) (0 qj^l(a^i) 

(2.3) 


jedl\i 


where equation ( 2 . 2 ) defines the message from a variable to a factor, closer to the root and similarly 
equation (2.3) defines the message from factor I to a variable i closer to the root. Here, the distributive 
law allows moving the © over the domain from outside to inside of equation (2.3) - the same way 
© moves its place in (o © 6 ) © (a © c) to give a © (6 © c), where a is analogous to a message. 

By starting from the leaves, and calculating the messages towards the root, we obtain the marginal 
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Figure 2.1: The figure shows a loop-free factor-graph and the direction of messages sent between variable and 
factor nodes in order to calculate the marginal over the grey region. 


over the root node as the product of incoming messages 

q(a^fc) = qi^fc(xfc) 

l&dk 


(2.4) 


In fact, we can assume any subset of variables (and factors within those variables) to be the root. 
Then, the set of all incoming messages to A, produces the marginal 


qfeA) = I 

.ICA / \ieA,Je9i,J2A 

Example 2.1.1. Consider the joint form represented by the factor-graph of figure 2.1 


(2.5) 


q(^) = fA(TA) 

Ae{I,J,K,L,0,T,U,V,W,X,Y,Z} 

and the problem of calculating the marginal over (^-e., the shaded region). 

q(®{i,i,fc}) =0 0 fA(xA) 

Ae{I,J,K,L,0,T,U,V,W,X,Y,Z} 

We can move the © inside the © to obtain 


q(®{ij,fc}) = fl(Ti) ® qL-^i(xi) © qK^i(xi) © qv^j(xj) © qw^jixj) © qK^fc(xfc) 
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where each term qA->-i factors the summation on the corresponding sub-tree. For example 

= 0fL(®L) 

Here the message qw^j is itself a computational challenge 

qw^i = 0 0 fA(®A) 

^\j Ae{W,U,Y,X,0,T,Z} 

However we can also decompose this message over sub-trees 

qw-^i = 0 fA(TA) ® qe^w(a:e) ® qr^w(a:r) 

^\j 

where again using distributive law qe-^-w and qr-s-w further simplify based on the incoming messages to 
the variable nodes Xr and x^- 

This procedure is known as Belief Propagation (BP), which is sometimes prefixed with the corre¬ 
sponding semiring e.g., sum-product BP. Even though BP is only guaranteed to produce correct answers 
when the factor-graph is a tree (and few other cases [8, 22, 310, 313]), it performs surprisingly well when 
applied as a fixed point iteration to graphs with loops [106, 225]. In the case of loopy graphs the message 
updates are repeatedly applied in the hope of convergence. This is in contrast with BP on trees, where 
the messages - from leaves to the root - are calculated only once. The message update can be applied 
to update the messages either synchronously or asynchronously and the update schedule can play an 
important role in convergence (e.^., [94, 173]). Here, for numerical stability, when the (8> operator has 
an inverse, the messages are normalized. We use oc to indicate this normalization according to the mode 
of inference 


Pl-^i(T*) 

OC 

0fl(Tl) 0 

¥.\i j&dl\i 

oc Pl^i(Pgj^.^j)(Xi) 

(2.6) 

Pi^l(Tj) 

OC 

0 Pj^j(Xj) 

Jeai\i 

0^ 

(2.7) 

P(S) 

oc 

fi(^i) 0 P^I(a;i) 
ieai 


(2.8) 

P(Tj) 

oc 

0Pl^j(Xi) 

l&di 


(2.9) 


Here, for general graphs, p(Tj) and p(ti) are approximations to p(xj) and p(xi) of equation (1.17). The 
functionals Pj_>i(pgj\^j_,.j)(.) and Pi_>.i(pgj^._^j)(.) cast the BP message updates as an operator on a 
subset of incoming messages - i.e., = {pj^i | J G \ I}. We use these functional notation in 
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presenting the algebraic form of survey propagation in section 2.5. 

Another heuristic that is often employed with sum-product and min-sum BP is the Damping of 
messages. This often improves the convergence when BP is applied to loopy graphs. Here a damping 
parameter A G (0,1] is used to partially update the new message based on the old one - e.g., for 
sum-produt BP we have 


Pl^i(Ti) 

oc Api^i(xi) -L (1 - A) ^ 


(2.10) 





9i^l{Xi) 

oc Api^i(xi) -L (1 - A) ^ 

Pj^i{Xi)j 

(2.11) 



^ JeaAi ^ 

(2.12) 


where as an alternative one may use the more expensive form of geometric damping (where A appears 
in the power) or apply damping to either variable-to-factor or factor-to-variable messages but not both. 
Currently - similar to several other ideas that we explore in this thesis - damping is a “heuristic”, which 
has proved its utility in applications but lacks theoretical justification. 

2.1.1 Computational Complexity 

The time complexity of a single variable-to-factor message update (equation (2.2)) is 0{\di\ \Xi\). To 

save on computation, when a variable has a large number of neighbouring factors, and if none of the 

e 

message values is equal to the annihilator 1 {e.g., zero for the sum-product), and the inverse of ( 8 > is 
defined, we can derive the marginals once, and produce variable-to-factor messages as 

Pi^i(xi) = p(xj) 0 (pi^i(xi))“^ VI G di (2.13) 

This reduces the cost of calculating all variable-to-factor messages leaving a variable from C(| A)! \di\'^) 
to CdTjl |9i|). We call this type of BP update, variable-synchronized (v-sync) update. Note that 
since max is not Abelian on any non-trivial ordered set, min-max BP does not allow this type of variable- 
synchronous update. This further motivates using the sum-product reduction of min-max inference. The 
time complexity of a single factor-to-variable message update (equation (2.3)) is 0(|Aj|). However as 
we see in section 2.2, sparse factors allow much faster updates. Moreover in some cases, we can reduce 
the time-complexity by calculating all the factor-to-variable messages that leave a particular factor at 
the same time {e.g., section 5.2). We call this type of synchronized update, factor-synchronized 
(f-sync) update. 
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2.1.2 The limits of message passing 

By observing the application of distributive law in semirings, a natural question to ask is: can we use 

distributive law for polynomial time inference on loop-free graphical models over any of the inference 

problems at higher levels of inference hierarchy or in general any inference problem with more than one 

marginalization operation? The answer to this question is further motivated by the fact that, when loops 

exists, the same scheme may become a powerful approximation technique. When we have more than 

one marginalization operations, a natural assumption in using distributive law is that the expansion 

operation distributes over all the marginalization operations - e.g., as in min-max-sum (where sum 

distributes over both min and max), min-max-min, xor-or-and. Consider the simplest case with three 
12 12 
operators ©, © and ©, where © distributes over both © and ©. Here the integration problem is 

qW = 0, 0, (8)fi(s) 

-J2 -Jl I 

where Ji and J 2 partition N}. 

1 2 

In order to apply distributive law for each pair (©,©) and (©,©), we need to be able to commute 
1 2 

© and © operations. That is, we require 

12 21 

0 0 s(Taub) = 0 0 g(TAuB)- (2-14) 

for the specified A C Ji and B C J 2 . 

Now, consider a simple case involving two binary variables Xi and Xj, where is 



0 

1 

0 

a 

b 

1 

c 

d 


Applying equation (2.14) to this simple case (i.e., A = {i},B = {j}), we require 


2 , 1 
o 


= (a 


1 , 2 
a 


The following theorem leads immediately to a negative result: 

Theorem 2.1.1. [91]: 


121 


212 


1 


2 
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which implies that direct application of distributive law to tractably and exactly solve any inference 
problem with more than one marginalization operation is unfeasible, even for tree structures. This 
limitation was previously known for marginal MAP inference [241]. 

Min and max operations have an interesting property in this regard. Similar to any other operations 
for min and max we have 


minmaxg(xjyj) 7 ^ maxming(xjyj) 

Xj Xj Xj Xj 

However, if we slightly change the inference problem (from pure assignments Xj^ G to a 
distribution over assignments; a.k.a. mixed strategies), as a result of the celebrated minimax theorem 
[300], the min and max operations commute - i.e., 

miiiinax Vs(xj)g(xiuj)s(^i) = max rnin V s(^i)g(xiuj)s(xjJ 

s(x j) s(Xi) ^ s(Xi) s(Xj) ^ 

ilUJ iluj 

where s(a:jJ and s(xj 2 ) are mixed strategies. This property has enabled addressing problems with 
min and max marginalization operations using message-passing-like procedures. For example, Ibrahimi 
et al. [140] solve this (mixed-strategy) variation of min-max-product inference. Message passing pro¬ 
cedures that operate on graphical models for game theory (a.k.a. “graphical games”) also rely on this 
property [161, 232], 

2.2 Tractable factors 

The applicability of graphical models to discrete optimization problems is limited by the size and 
number of factors in the factor-graph. In section 2.2.1 we review some of the large order factors that 
allow efficient message passing, focusing on the sparse factors used in part II to solve combinatorial 
problems. In section 2.2.2 we introduce an augmentation procedure similar to cutting plane method to 
deal with large number of “constraint” factors. 

2.2.1 Sparse factors 

The factor-graph formulation of many interesting combinatorial problems involves sparse (high-order) 
factors. Here, either the factor involves a large number of variables, or the variable domains, Aj, have 
large cardinality. In all such factors, we are able to significantly reduce the 0(jA[j) time complexity of 
calculating factor-to-variable messages. Efficient message passing over such factors is studied by several 
works in the context of sum-product and min-sum inference classes [123, 249, 269, 291, 292]. Here we 
confine our discussion to some of the factors used in part H. 

The application of such sparse factors are common in vision. Many image labelling solutions to 
problems such as image segmentation and stereo reconstruction, operate using priors that enforce sim- 
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ilarity of neighbouring pixels. The image processing task is then usually reduced to finding the MAP 
solution. However pairwise potentials are insufficient for capturing the statistics of natural images and 
therefore higher-order-factors have been employed [168—170, 174, 183, 234, 268]. 

The simplest form of sparse factor in combinatorial applications is the Potts factor, f|j j|(a:i, Xj) = 
l(xj = Xj). This factor assumes the same domain for all the variables {Xi = Xj and its tabular 

form is non-zero only across the diagonal. It is easy to see that this allows the marginalization of 
equation (2.3) to be performed in Odffjl) rather than 0(1^)! \Xj\). Another factor of similar form is the 
inverse Potts factor, f|j = l(xj ^ Xj), which ensures Xi / Xj. In fact any pair-wise factor that 

is a constant plus a band-limited matrix allows 0(1^)!) inference {e.g., factors used for bottleneck 
TSP in section 5.2.2). 

Another class of sparse factors is the class of cardinality factors, where Xi = {0,1} and the factor 
is defined based on only the number of non-zero values -i.e., fi(xj) = g{Yliedi^i)- Gail et al. [105] 
proposes a simple 0(|dl| K) method for f(g:i) = l((X]jgaiTj) = K). We refer to this factor as K-of-N 
factor and use similar algorithms for at-least-K-of-N fi(xi) = a^i) > K) and at-most-K-of-N 

fi(xi) = l((I]jggiTj) < K) factors. 

An alternative is the linear clique potentials of Potetz and Lee [249]. The authors propose a 
0(|dl| lAip) (assuming all variables have the same domain Xi) marginalization scheme for a general 
family of factors, called linear clique potentials, where fi(xi) = ^ nonlinear gi(.). For 

sparse factors with larger non-zero values (i.e., larger k), more efficient methods evaluate the sum of 
pairs of variables using auxiliary variables forming a binary tree and use the Fast Fourier Transform to 
reduce the complexity of K-of-N factors to 0(|i9I| log(|(9I|)^) (see [292] and references in there). 

Here for completeness we provide a brief description of efficient message passing through at-least- 
K-of-N factors for sum-product and min-sum inference. 

K of N factors for sum-product 

Since variables are binary, it is convenient to assume all variable-to-factor messages are normalized such 
that pj_s.i(0) = 1. Now we calculate pi^i(O) and pi_^i(l) for at-least-K-of-N factors, and then normalize 
them such that pi_^i(0) = 1. 

In deriving pi_,.j(0), we should assume that at least K other variables that are adjacent to the factor 
fi are nonzero and extensively use the assumption that pj_^/(0) = 1. The factor-to-variable message of 
equation (2.6) becomes 

Pi^i(O) = ^ir(^Xj)>A:^ Pj^i(xj) 

^\i ^ jedl\i ^ jedl\i 

ACdl\i, |A|>A isA 


(2.15) 
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where the summation is over all subsets A of 91 \ i that have at least K members. 

Then, to calculate pi_>.j(l) we follow the same procedure, except that here the factor is replaced by 


1 Thi, i. because here we aeeume a. = 1 and therefore it ie euffieient tor A' - 1 

other variables to be nonzero. 

Note that in equation (2.15), the sum iterates over “all” A C 9I\i of size at least K. For high-order 
factors fi (where |I| is large), this summation contains an exponential number of terms. Fortunately, 
we can use dynamic programming to perform this update in 0(|9I| K). The basis for the recursion of 
dynamic programming is that, starting from B = I \ i, a variable C can be either zero or one 

Ae{KCB,|K|>fc}ieA 

Ae{KCB\fc, |K|>A} jeA \Ae{KCB\fc, |A|>A-l}ieA 

where each summation on the r.h.s. can be further decomposed using similar recursion. Here, dynamic 
program reuses these terms so that each is calculated only once. 


K of N factors for min-sum 

Here again, it is more convenient to work with normalized variable-to-factor messages such that pj_>i(0) = 
1=0. Moreover in computing the factor-to-variable message pi_^j(xi) we also normalize it such that 
pi^i(O) = 0. Recall the objective is to calculate 


Pl-^i(Xi) 


= min 1 

5\i 


/ ^ 

ieai 


Xi = 




for Xi = 0 and Xj = 1. 

We can assume the constraint factor is satisfied, since if it is violated, the identity function evaluates 
to -|-oo (see equation (1.20)). For the hrst case, where xt = 0, K out of |9I \ i\ neighbouring variables 
to factor I should be non-zero (because l((X]je9i^i) ~ ~ ®)- minimum is obtained if 

we assume the neighbouring variables with smallest pj_>i(l) are non-zero and the rest are zero. For 
Xj = 1, only AT — 1 of the remaining neighbouring variables need to be non-zero and therefore we need 
to find K — 1 smallest of incoming messages (pj^i(l) Vj G 91 \ i) as the rest of messages are zero due 
to normalization. 

By setting the pi^i(O) = 0, and letting A{K) C dl\ i identify the set of K smallest incoming 
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messages to factor I, the pi_ 5 .j(l) is given by 

Pi^i(l) = ( Pi^i(l)) = Pi^^i(l) 

^ieA(A) ^ ^jeA(K-l) ^ 

where is the index of smallest incoming message to I, excluding pj_>i(l). A similar procedure 
can give us the at-least-K-of-N and at-most-K-of-N factor-to-variable updates. 

If K is small (i.e., a constant) we can obtain the smallest incoming message in 0{K |9I|) time, 
and if K is in the order of |cll| this requires 0(|(9I| log(|cII|)) computations. For both min-sum and 
sum-product, we incur negligible additional cost by calculating “all” the outgoing messages from factor 
I simultaneously (he., f-sync update). 

2.2.2 Large number of constraint factors 

We consider a scenario where an (exponentially) large number of factors represent hard constraints 
(see definition 1.3.2) and ask whether it is possible to find a feasible solution by considering only 
a small fraction of these constraints. The idea is to start from a graphical model corresponding to a 
computationally tractable subset of constraints, and after obtaining a solution for a sub-set of constraints 
(e.^., using min-sum BP), augment the model with the set of constraints that are violated in the current 
solution. This process is repeated in the hope that we might arrive at a solution that does not violate 
any of the constraints, before augmenting the model with “all” the constraints. Although this is not 
theoretically guaranteed to work, experimental results suggest this can be very efficient in practice. 

This general idea has been extensively studied under the term cutting plane methods in different 
settings. Dantzig et al. [77] first investigated this idea in the context of TSP and Gomory et al. 
[118] provided a elegant method to identify violated constraints in the context of finding integral 
solutions to linear programs (LP). It has since been used to also solve a variety of nonlinear optimization 
problems. In the context of graphical models, Sontag and Jaakkola [284] (also [286]) use cutting plane 
method to iteratively tighten the marginal polytope - that enforces the local consistency of marginals; 
see section 2.3 - in order to improve the variational approximation. Here, we are interested in the 
augmentation process that changes the factor-graph (i.e., the inference problem) rather than improving 
the approximation of inference. 

The requirements of the cutting plane method are availability of an optimal solver - often an LP 
solver - and a procedure to identify the violated constraints. Moreover, they operate in real domain 
W^; hence the term “plane”. However, message passing can be much faster than LP in finding ap¬ 
proximate MAP assignments for structured optimization problems [325]. This further motivates using 
augmentation in the context of message passing. 

In sections 4.5 and 5.2, we use this procedure to approximately solve TSP and graph-partitioning 
respectively. Despite losing the guarantees that make cutting plane method very powerful, augmentative 
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message passing has several advantages: First, message passing is highly parallelizable. Moreover by 
directly obtaining integral solutions, it is much easier to find violated constraints. (Note the cutting 
plane method for combinatorial problems operates on fractional solutions, whose rounding may eliminate 
its guarantees.) However, due to non-integral assignments, cutting plane methods require sophisticated 
tricks to find violations. For example, see [12] for application of cutting plane to TSP. 

2.3 Inference as optimization 

The variational approach is concerned with probabilities, and therefore this section is limited to opera¬ 
tions on real domain. In the variational approach, sum-product inference is expressed as 

p = arg^min D(p | p^) (2.16) 

where D(p | p^) is the KL-divergence between our approximation p and the true distribution p at 
inverse temperature (3 (see example 1.1.4). Here p is formulated in terms of desired marginals. 

Expanding the definition of KL-divergence and substituting p from equation (1.18), equation (2.16) 
becomes 


p = arg^min ^ p(x) log(p(x)) - 

X 

arg^ min E p(x)log(p(x)) - 

X 

where we have removed the log partition function log(q(0, /3)) = log Hi from equation (2.17) 

because it does not depend on p. This means that the minimum of equation (2.18) is log(q(0, /3)), which 
appears when D(p | p^) = 0 - ie., p = p^. 

The quantity being minimized in equation (2.18), known as variational free energy, has two 
terms: the (expected) energy term 

U(p,p) -^p(x) (^log(fi(xi)) 

X \ I 

and the entropy term 

H(p) = - J]] p(t) log(p(^)) . 

X 

Different families of representations for p(.) (in terms of its marginals) produces different inference 
procedures such as BP, Generalized BP and Mean-field method [303]. 

Max-product (or min-sum) inference is retrieved as zero-temperature limit of sum-product infer- 


P X] log(p(x)) = (2.17) 

X 

(^^^°s(fi(^i))^ (2.18) 
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ence: 


p(.) = lim argsmin/3U(p,p) - H(p) 

/3^>+oo ^ 

= arg-min-^p(x)(^log(fi(xi))) (2.19) 

X I 

where the energy term is linear in p and therefore the optima will be at a corner of probability simplex, 
reproducing the MAP solution. 

Here by defining ^(x) we get the min-sum form 

P(-) = arg^min^p(xi)(^log(f((xi)) 

X I 

We observe that using a second parameter a, min-max inference is also retrievable^ 

X I 

= arg^ min E p(xi)(maxlog(fi(xj))) 

X 

where again due to the linearity of the objective in p, the optima are at the extreme points of the 
probability simplex. 

To retrieve sum-product BP update equations from divergence minimization of equation (2.16), we 
will reparameterize p using its marginals p(xj) and p(xj). Here, we present this reparametrization 
in a more general form, as it holds for a any commutative semiring where (T*, ®) is an abelian group. 

Proposition 2.3.1. If the (8> operator of the semiring has an inverse and the factor-graph is loop-free, 
we can write p(x) as 


( 2 . 20 ) 


p(x) 


0iPfe) 

0i(p(xi)O(|9f| - 1)) 


( 2 . 21 ) 


d&f 

where the inverse is w.r.t ® and the exponentiation operator is defined as aQb = a ■ y 'Si a^ 

b times 

Proof. For this proof we use the exactness of BP on trees and substitute BP marginals equation (2.9) 


^Here we assume there are no ties at the min-max solution i.e., fifxj) > fj(xj) VJ 7^ I. 
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into equation (2.21): 


- 1 )) 

= Pix) 

where we substituted the variable-to-factor messages in the denominator with factor-to-variable mes- 

sages according to equation (2.2) and used the definition of inverse {i.e., a® a ^ = 1) to cancel out the 
denominator. □ 

Intuitively, the denominator is simply cancelling the double counts - that is since p(xj) is counted 
once for any I G in the nominator, the denominator removes all but one of them. 


0lPfe) 

0i(p(a;*)o(|5*l -1)) 

( 01gai Pi-5-l(Ti)) 


2.3.1 Sum-product BP and friends 

Rewriting equation (2.21) for sum-product ring p(3:) = replacing p in the variational 

energy minimization, we get 

p = arg^min /3 ^ p(xi)fi(xi) 

I 

EE P(^i) log(p(xi)) j - (E(i-I«I)E p(xi)log(p(xi)) 

I J \ i Xi 

such that p(tj) = p(xj) Vi, I G di 

x:\^ 

X] ^ 

Xi 

where the energy term equation (2.22) is exact and the quantity that is minimized is known as Bethe 
free energy [30, 328]. The constraints equations (2.24) and (2.25) ensure that marginals are consistent 
and sum to one. Following the lead of Yedidia et al. [328], Heskes [132] showed that stable fixed points 
of sum-product BP are local optima of Bethe free energy. 

The optimization above approximates the KL-divergence minimization of equation (2.18) in two 
ways: (I) While the marginal constraint ensure local consistency, for general factor-graphs there is no 
guarantee that even a joint probability p with such marginals exists (i.e., local consistency conditions 
outer-bound marginal polytope; the polytope of marginals realizable by a join probability p(t)). 
(II) Bethe entropy is not exact for loopy factor-graphs. Using the method of Lagrange multipliers to 
enforce the local consistency constraints and setting the derivatives of equation (2.23) w.r.t. p to zero, 
recovers sum-product BP updates [327, 328]. This optimization view of inference has inspired many 


( 2 . 22 ) 

(2.23) 

(2.24) 

(2.25) 
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sum-product inference techniques with convex entropy approximations and convergence guarantees [113, 
126, 134, 206, 293, 304, 319, 329]. 

2.3.2 Min-sum message passing and LP relaxation 

LP relaxation of min-sum problem seeks marginals p( 2 ;i)VI 

p = arg^min ^p(xi)fi(xi) 

I 

such that pfe) = p(aJi) Vi, I G di 

^\i 

Xi 

If integral (i.e., p(xi) = l{xi = x*) for some x* = {x^,..., x*j^}), this LP solution is guaranteed to be 
optimal (i.e., identical to equation (2.19)). Taking the zero temperature limit (hm/3 —)■ oo) of the Bethe 
free energy of equations (2.22) and (2.23), for any convex entropy approximation [126, 133, 304, 305], 
ensures that sum-product message passing solution recovers the Linear Programming (LP) solution [311]. 
Moreover, replacing the summation with maximization (which again corresponds to temperature limit) 
in the resulting convex message passing, produces the convex min-sum message passing, which agrees 
with LP relaxations, under some conditions {e.g., when there are no ties in beliefs). The general interest 
in recovering LP solutions by message passing is to retain its optimality guarantees while benefiting from 
the speed and scalability of message passing that stems from exploitation of graphical structure [324]. 
One may also interpret some of these convex variations as replicating variables and factors while keeping 
the corresponding messages identical over the replicates [271]. After obtaining message updates, the 
number of replicates are allowed to take rational values (Paris! introduced a similar trick for estimation 
of the partition function using replica trick [55, 211]). 

Another notable variation for approximate MAP inference is max-product-linear-program , which 
performs block coordinate descend in the space of duals for equation (2.26). MPLP is guaranteed to 
converge and is often able to recover LP solution [114, 285]. Finally dual (and primal) decom¬ 
position methods minimize factors separately and combine their estimates in a way that agrees with 
sub-gradient in each iteration [42, 155, 175, 176]. 

2.3.3 Min-max and other families 

By rephrasing the variational min-max inference of equation (2.20) 

p = arg^ min E p(xi)(maxlog(fi(xi))) 


(2.26) 

(2.27) 
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in terms of marginals p(xj) and enforcing marginal consistency constraints, we obtain the following LP 
relaxation 


p = arg-min y (2.28) 

such that E p(®i)fi(Ti) <y VI 

^p(m) = p(xi) Vi,l e di 

X\i 

X] 1 

Xi 

which surprisingly resembles our sum-product reduction of min-max inference in equation (1.25). Here 
p(xj)fi(xj) < y is a relaxation of our sum-product factor l(fi(xj) < y) in equation (1.27). 

Claim 2.3.2. y in equation (2.28) lower bounds the min-max objective y*. Moreover, ifp is integral, 
then y = y* and x* is the optimal min-max assignment. 

Proof. The integral solution p, corresponds to the following optimization problem 

x* = arg^, min y (2.29) 

such that fl(iLi) < U 

= arg 3 ,min maxfi(xj) 

which is the exact min-max inference objective. Therefore, for integral p, we obtain optimal min-max 
solution. On the other hand by relaxing the integrality constraint, because of the optimality guarantee 
of LP, the LP solution y can not be worse than the integral solution and its corresponding value y*. □ 

This lower bound complements the upper bound that we obtain using a combination of sum-product 
reduction and an incomplete solver (such as perturbed BP of section 2.6.3) and can be used to assess 
the optimality of a min-max solution. 

The only other extensively studied inference problem in the inference hierarchy of section 1.2 is max- 
sum-product (a.k.a. marginal MAP) inference [80, 152, 202]. In particular variational formulation 
of max-sum-product inference [195], substitutes the entropy term in equation (2.17) with conditional 
entropy. 

2.4 Loop corrections 

In this section we first review the region-based methods that account for short loops in section 2.4.1 
and then show how to perform loop correction by taking into account the message dependencies in 
section 2.4.2. In section 2.4.3 we introduce a loop correction method that can benefit from both types 
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of loop corrections, producing more accurate marginals. While the region-based techniques can be used 
to directly estimate the integral, the approximation techniques that take message dependencies into 
account are only applied for estimation of marginals. 

2.4.1 Short loops 

We consider a general class of methods that improve inference in a loopy graphical model by performing 
exact inference over regions that contain small loops. 

The earliest of such methods is junction-tree [148, 186], which performs exact inference with 
computation cost that grows exponentially in the size of largest region -i.e., tree width [57]. Here, 
regions form a tree and the messages are passed over regional intersections. While this algorithm is still 
popular in applications that involve certain class of graphs [35] or when the exact result is required, 
most graphs do not have a low tree width [157, 167]. 

An extension to junction tree is the junction graph method [6] that removes the requirement for 
the regions to form a tree. For this, the proxy between two regions is a subset of their intersection 
(rather than the whole intersection) and one still requires the regions that contain a particular variable 
to form a tree. Similar ideas are discussed under the name of cluster graphs in [171]. 

Inspired by the connection between Bethe free energy and belief propagation (see section 2.3), 
Yedidia et al. [328] proposed Generalized BP that minimizes Kikuchi approximation to free energy 
(a.k.a. Cluster Variational Method [165, 246]). Here the entropy approximation is obtained from a 
region-graph. 

A region p is a collection of connected variables V(p) and a set of factors J-{p) such that each 
participating factor depends only on the variables included in the region. To build the CVM region- 
graph"^, one starts with predefined top (or outer) regions such that each factor is included in at least 
one region. Then, we add the intersection of two regions (including variables and factors) recursively 
until no more sub(inner)-region can be added. Each region is then connected to its immediate parent. 
A region-graph, reparameterizes p(.) in terms of its marginals over the regions 

p(t) = 

p 

where c(p) is the counting number for region p and ensures that each variable and factor is counted 
only once. This number is recursively defined by Mobius formula for inner regions: 

c(p) = 1 - 

p'Dp 

where p' is an ancestors of p in the region graph. 

^Here we are making a distinction between a general region graph and a CVM region-graph. 

® More accurately V{p) C V{p'). 
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Similar to BP, by substituting the reparametrization of equation (2.30) into the variational free 
energy minimization of equation (2.18) we get 

p = arg^min p(^v(p)) ( Y - P(®v(p))log(p(Tv(p))) 

p \^v{p) \ leJ^(p) 

s.t. J]] p(xv(p')) = P(^v(p)) VpCp (2.32) 

5\V(p) 

which is known as Kikuchi approximation to free energy [165]. The constraints of equation (2.32) 
ensure that marginals are consistent across overlapping regions. Solving this constraint optimization 
using the method of Lagrange multipliers, yields a set of recursive equations that are known as Gen¬ 
eralized BP equations [328]. Again a region-based approximation is exact only if the region-graph has 
no loops. 

A region-graph without restrictions on the choice of regions generalizes junction-graph method as 
well. The general construction of the region graph only requires that the counting numbers of all the 
regions to which a variable (or a factor) belong, sum to 1 [327]. For different criteria on the choice of 
regions see also [235, 314, 317]. 

2.4.2 Long loops 

In graphical models with long-range correlation between variables, region-based methods are insufficient. 
This is because their complexity grows exponentially with the number of variables in each region and 
therefore they are necessarily inefficient account for long loops in the graph. 

A class of methods for reducing long-range correlations are methods based on cut-set conditioning 
[245], where a subset of variables are clamped, so as to remove the long-range correlations that are formed 
through the paths that include the cut-set. For example, consider a Markov network in the form of 
a cycle. By fixing any single variable, the reduced factor graph becomes a tree and therefore allows 
exact inference. Several works investigate more sophisticated ideas in performing better inference by 
clamping a subset and the resulting theoretical guarantees [78, 89, 312], A closely related idea is Rao- 
Blackwellization (a.k.a. collapsed MCMC; see section 2.6.1), a hybrid approach to inference [108] 
where particles represent a partial assignment of variables and inference over the rest of variables is 
performed using a deterministic method. The deterministic inference method such as BP is used to 
calculate the value of the partition function, for each possible joint assignment of the variables that are 
not collapsed. Then collapsed particles are sampled accordingly. This process in its general form is very 
expensive, but one could reduce the cost, depending on the structure of the network [32]. 

The loop calculus of Chertkov [60] [61] expands the free energy around the Bethe approximation, 
with one term per each so-called generalized loop in the graph. Since the number of loops grows 
exponentially in the number of variables, this expansion does not provide a practical solution. Some 
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attempts have been made to to make this method more practical by truncating the loop series [117]. 
While the original loop series was proposed for binary valued and pairwise factors, it has been generalized 
to arbitrary factor-graphs [315, 320] and even region-graphs [334], 

Another class of approximate inference methods perform loop correction by estimating the message 
dependencies in a graphical model [217, 220, 262]. These methods are particularly interesting as they 
directly compensate for the violated assumption of BP - i.e., corresponding to independent set of 
incoming messages. 

For the benefit of clarity, we confine the loop correction equations in this section and its generalization 
in the next section to Markov networks (i.e., jcllj = 2); see [254] for our factor-graph versions. Although 
the previous works on loop corrections have been only concerned with sum-product inference, here we 
present loop corrections for a general commutative semiring (T*,©,®) in which the operation © has 
an inverse (i.e., (T*,®) is a group). In particular this means these loop corrections may be used for 
min-sum class of inference. 

Here we rewrite BP update equations (2.2) and (2.3) for Markov networks^ 

Pi^j{Xi) (X © ® ^Pk^iixk) (2.33) 

\xi kGAi\j 

p{xi) oc ©® kk,i}ixk^i)Pk^i{xk) (2.34) 

\xi kGAi 

Figure 2.2(left) shows the BP messages on a part of Markov network. Here if the Markov network 
is a tree, HP’s assumption that Ps^i, Po->-i and Pn^i are independent is valid, because these messages 
summarize the effect of separate sub-trees on the node i. However if the graph has loops, then we use 
h(x^j\^) to denote message dependencies. If we had access to this function, we could easily change 
the BP message update of equation (2.33) to 

Pi^jixi) oc h(^AAi)® © ® kk,i}{X{k,i}) ^Pk^iiXk) 

\xi k£Ai\j 

Since it is not clear how to estimate h(3iAi\j)) follow a different path, and instead estimate the 
so-called cavity distribution, which we denote by h(xAj), which is simply the joint marginal over the 
Markov blanket after making a cavity - i.e., removing a variable i and its neighbouring factors fj VI G di. 
However, since the missing information is the “dependence” between the messages, h(a;;^(j)) has a degree 
of freedom w.r.t. individual marginals - i.e., it can be inaccurate by a factor of 0jeA(i) 
some gjVj, without affecting the loop-corrected message passing procedure. This degree of freedom is 
essentially equivalent to the freedom in initialization of BP messages. In the following, we show the 
resulting loop-corrected message passing. But first we write BP updates in a different form. 

^ Note that is over Xt rather the conventional way of dehning it on Xj. This formulation is the same as original 

BP equation for Markov network if the graph does not have a loop of size two. 
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Figure 2.2: (left) BP messages on a Markov network and the ideal way dependencies should be taken into 
account, (right) BP marginal over extended Markov blanket Vj for node j and the message dependencies over 
the Markov blanket Ai for node i. 


Define the extended Markov blanket Vf = Af U {i} be the Markov blanket of i plus i itself, see 
figure 2.2 (right). We can write BP marginals over Vi 


P(TVi) OC 

0 ® 

k^Ai 

(2.35) 

Using this, equation (2.33) simplifies to: 



Pi^j{Xi) 

oc 0P(TvJ/f{*j}(3f{ij}) 

\a:i 

OC 0p(^vi) 

(2.36) 

P(Tfc) 

(2.37) 


\xi 


Now assume we are given the dependency between the messages Pk^i{xk) ffk G Ai in the form of 
h(xAi)- This means we can re-express equation (2.35) as 

Pfevi) oc h(TAi) f{j,fc}(Tp^fc}) (8) Pfc^i(xfc) (2.38) 

keAi 

By enforcing the marginal consistency of p(xyj) and p(xyj) over Xi (and Xj) 

0 p(^Vi)A{ij}(a^o2;j) = 0 p(^vi)A{ij}(a^o4;i) 

-\i,3 -\i,j 

we retrieve a message update similar to that of BP (in equation (2.33)) that incorporates the dependency 








CHAPTER 2. APPROXIMATE INFERENCE 


52 


between BP messages 


At+l) 

i^j 


{Xi) oc 




(2.39) 


It is easy to verify that this update reduces to BP updates (equation (2.33)) when h(®gj) is uniform 
- that is we do not have any dependency between messages. The loop-correction method of Mooij 
et al. [ 220 ] is similar, however this interpretation does not apply to their updates for factor graphs. 
We extend the same idea to perform loop-correction for overlapping regions of connected variables in 
section 2.4.3 where we pass the messages from one region to the outer boundary of another region. 

The main computational cost in these loop correction methods is estimating h(;r^j), the message 
dependencies. We use clamping to perform this task. For this we remove Xi and all the immediately 
depending factors from the graph. Then we approximate the marginal h(x^j) by reduction to integra¬ 
tion; see section 1.4.1. Note that the h(x^j) obtained this way contains not only dependencies but also 
the individual marginals in the absence of node i (h(xj) Vj G Ai). However since the messages updates 
for pj^i{xj) Vj G Ai, perform “corrections” to this joint probability, we do not need to divide h(x^j) 
by the individual marginals. 


2.4.3 Both message dependencies and short loops 

Section 2.4.1 presented loop correction methods that improve loopy BP by considering interactions 
within small clusters of variables, thus taking small loops within these clusters into account. The 
previous section showed how to account for dependency between BP messages - thus taking long-range 
correlations into account. In this section we introduce a generalization that performs both types of loop 
correction. 

The basic idea is to form regions, and perform exact inference over regions, to take short loops 
into account. However in performing message passing between these regions, we introduce a method to 
perform loop correction over these messages. 

We start by defining a region p = {i,... ,1} as a set of connected variables. Note that this definition 
is different from definition of region for region-based methods as it only specifies the set of variables, 
and not factors. Let Ap = {i € Aj, i ^ p \ j € p} he the Markov blanket of region p, and as before let 
Vp = p U Ap. 

Region pi is a neighbour of p 2 with neighbourhood pi :2 iff pi -2 (Api)np 2 7 ^ 0 - i.e., the Markov 
blanket of pi intersects with p 2 (note that pi :2 is different from p 2 :i). The messages are exchanged on 
these neighbourhoods and pi^ 2 {Xp^,^) is a message from region pi to p 2 - 

Example 2.4.1. Figure 2.3 shows a set of neighbouring regions (indexed by 1,2,3,4,5,6 and 7). 
Here the region pi receives “overlapping” messages from four other regions. For example the message 
?Q-^i{xd,i) overlaps with the message p 5 _s.i(x^g) as well as p 7 _^i(x^ J. Therefore simply writing p(Tvp) 
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Figure 2.3: (top) seven regions pi, ... ,p 7 and the domain of messages sent from each region to pi. Here p 5 :i is 
the domain of message from region 5 to region 1. (bottom) the message region-graph shows how these overlapping 
messages are combined to prevent double-counts. 

in terms of the factors inside Vp and the incoming messages (as in equation (2.5)) will double-count 
some variables. 

Message region-graphs 

Here, similar to section 2.4.1, we construct a region-graph to track the double-counts. However, here we 
have one message-region-graph per region p. The construction is similar to that of cluster variational 
methods; we start with the original message-domains (e.g., p 2 -.i) and recursively add the intersections, 
until no more message-region 7 can be added. Each message-region 7 is connected to its immediate 
parent, figure 2.3 shows the two-layered message-region-graph for region pi. Here, for discussions 
around a particular message-region-graph we will drop the region-index 1 . 

Let m(;c..^) = p7r-s.i(Xp^.^) Vtt be the top regions in the region-graph, consisting of all the incoming 
messages to region of interest pi. The Mobius formula (equation (2.31)) gives counting number for 
message-region (here again the top regions’ counting number is one). A downward pass, starting 
from top regions, calculates the belief m(x.^) over each message-region, as the average of beliefs over its 
parents. 
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Example 2.4.2. In figure 2.3 the belief m(xe) is the average of beliefs over m(x|^gj) = P 5 ->.i(Tpg.J 
and when marginalized over Xg. Here the counting number of Xg is c({e}) = 

1-(1 + 1 ) = - 1 . 

def 

We require the ® operator of the semiring to have an inverse. Recall the power operator © yQk = 
y ® ® y. For sum-product and min-sum semirings, this operator corresponds to exponentiation and 

'-V-" 

k times 

product respectively and it is well-dehned also for rational numbers k. Now define the average as 

avg{{yi,...,yk}) (0^*)©^ (2-40) 

i 

Using Pa (7) to denote the parents of region 7, in the downward pass 

m(x^)oc (g) avg(0m(7')) (2.41) 

7'ePa(7) 

where m(x..^) for top regions are just the incoming messages. 

Let fA(xA) = 0ICA ^ 1 ( 4 ^ 1 ) be the semiring-product of all the factors defined over a subset of A. For 
example in hgure 2.3, fvpa(®Vpa) ^^e product of 9 pairwise factors (ie., all the edges in the figure). 
After a downward pass, the belief over Vp (analogous to equation (2.35)): 

p(Tvp) ot fvp(®vp)© f 

that is p(xyp) is the semiring product of all the factors inside this region and all the beliefs over message- 
regions inside its message-region-graph, where double counts are taken into account. For our example, 
assuming sum-product ring, 

P(®VpJ = fvp.(svpjm(x4_5)... m(x9^4) ( m(x4)“^... m(x9)“^ 


where the semiring-product and inverse are product and division on real domain. 

At this point we can also introduce an estimate for message dependencies h(xgp) into the equation 
above, and generalize the update of equation (2.39) to 


PUa (4^P,:, 


) OC 


0W P(^Vpi,)/fVpanVp6(TVpanVp6) ..X 
m ') /f ~ UJ I ® P6-5-a(4ip6:a) 

P t^Vpa ) / WpaH Vpi, {Xy pf, ) 


(2.43) 


One last issue to resolve is to define the “effective” message P^}^g^{xp^.^), which is different from 
Pb^ai^pba^' bid not directly use p[!^q(Xpj^.^) in the previous iteration, we should not include it 

directly in this update. Instead we use the message region-graph for region a to calculate the effective 
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message: 


h^aixp^j = ® (m(x.^) 0 c( 7 )) (2.44) 

iQpb-.a 

The effective message, as defined above (equation (2.44)), can be efficiently calculated in an upward 
pass in the message region-graph. Starting from the parents of the lowest regions, update the belief 
m(x.y) obtained in downward pass equation (2.41) using the new beliefs over its children: 


m(x^) 


m(T^) 

7'6Ch(7) 



(2.45) 


where Ch( 7 ) is the set of children of 7 in the message region-graph. After the upward pass, the new 
beliefs over top regions gives us the effective messages 




(2.46) 


Example 2.4.3. In our example of hgure 2.3(bottom), assuming a sum-product ring, since the message- 
region-graph only has two layers, we can write the effective message ph^aix^ 4 ^g}) as 


P/i—>-a(T{4,9}) ~ P/i—^-a (T{4,9} ) 


m(x4)m(x9) 

(Sx4 P^^'a(ili{4,9}))(X]x9 {4,9})) 


(2.47) 


This form of loop correction, which we call Generalized Loop Correction (GLC), generalizes both 
correction schemes of sections 2.4.1 and 2.4.2. The following theorem makes this relation to generalized 
BP more explicit. 


Theorem 2.4.1. For Markov networks, if the regions {p} partition the variables, then any Generalized 
BP fixed point of a particular region-graph construction is also a fixed point for GLC, when using uniform 
message dependencies (i.e., h(x^J oc 1 


2.4.4 Experiments 

This section compares different variations of our generalized loop correction method (GLC) for sum- 
product ring against BP as well as Cluster Variational Method (CVM; section 2.4.1), loop correction 
of [220] (LCBP), which does not exactly account for short loops and the Tree Expectation Propagation 
(TreeEP) [214] method, which also performs some kind of loop correction. For CVM, we use the double¬ 
loop algorithm [133], which is slower than Generalized BP but has better convergence properties.® We 

®See [254] for the proof. 

®A11 methods are applied without any damping. We stop each method after a maximum of 1E4 iterations or if the 
change in the probability distribution (or messages) is less than lE-9. 
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report the time in seconds and the error for each method as the average of absolute error in single 
variable marginals - i.e., ^ I P(^i) “ P(^*) I- each setting, we report the average results 

over 10 random instances of the problem. We experimented with grids and 3-regular random graphs.^ 
Both LCBP and GLC can be used without any information on message dependencies, with an 
initial cavity distribution estimated via clamping cavity variables. In the experiments, full means 
message dependencies h was estimated while uniform means h was set to uniform distribution (ie., loop- 
correction was over the regions only). We use GLC to denote the case where the regions were selected 
such that they have no overlap (i.e., Pa^ Pb = ^ ya,b) and GLC-I- when overlapping clusters of some 
form are used. For example, GLC+(Loop4, full) refers to a setting with message dependencies that 
contains all overlapping loop clusters of length up to 4. If a factor does not appear in any loops, it 
forms its own cluster. The same form of clusters are used for CVM. 

Grids 

We experimented with periodic spin-glass Ising grids of example 1.1.4. In general, smaller local fields 
and larger variable interactions result in more difficult problems. We sampled local fields independently 
from A/'(0,1) and interactions from A/'(0,/3^). Figure 2.4a summarizes the results for 6x6 grids for 
different values of j3. 

We also experimented with periodic grids of different sizes, generated by sampling all factor entries 
independently from A/'(0,1). Figure 2.5a compares the computation time and error of different methods 
for grids of sizes that range from 4x4 to 10x10. 

Regular Graphs 

We generated two sets of experiments with random 3-regular graphs (all nodes have degree 3) over 40 
variables. Here we used Ising model when both local fields and couplings are independently sampled 
from AA(0,/?^). Figure 2.4b shows the time and error for different values of /3. Figure 2.5b shows time 
versus error for graph size between 10 to 100 nodes for (3 = 1. 

Our results suggest that by taking both long and short loops into account we can significantly 
improve the accuracy of inference at the cost of more computation time. In fact both plots in figure 2.5 
show a log-log trend in time versus accuracy which suggests that taking short and long loops into 
account has almost independently improved the quality of inference. 

2.5 Survey Propagation: semirings on semirings 

Survey propagation (SP) was first introduced as a message passing solution to satisfiability [48] and was 
later generalized to general GSP [47] and arbitrary inference problems over factor-graphs [208]. Several 


^The evaluations are based on implementation in libdai inference toolbox [221]. 
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- ■ GLC+ (Loop4, Uniform) TreeEP 



• • BP 

- GLC+ (Loops, Full) 

- ■ GLC+ (Loop7, Full) 

- - CVM (Loops) 


• ■ CVM (Loop?) 
— LCBP(Full) 
TreeEP 


(a) spin-glass Ising grid 


(b) spin-glass Ising model on a 3-regular graph 


Figure 2.J^: Average Run-time and accuracy for 6x6 spinglass Ising grids and 3-regular Ising model for different 
values of p. Variable interactions are sampled from Af{0, /3'^) and local fields are sampled from JV{0,1). 
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(b) spin-glass Ising model on 3-regular graph 


Figure 2.5: Time vs error for Ising grids and 3-regular Ising models with local field and interactions sampled 
from a standard normal. Each method in the graph has 10 points, each representing an Ising model of different 
size flO to 100 variables). 


works offer different interpretations and generalizations of survey propagation [46, 177, 199]. Here, 
we propose a generalization based the same notions that extends the application of BP to arbitrary 
commutative semirings. Our derivation closely follows and generalizes the variational approach of 
Mezard and Montanari [208], in the same way that the algebraic approach to BP (using commutative 
semirings) generalizes the variational derivation of sum-product and min-sum BP. 

As a hxed point iteration procedure, if BP has more than one fixed points, it may not converge at 
all. Alternatively, if the messages are initialized properly BP may converge to one of its hxed points. 
SP equations, take “all” BP hxed points into account. In our algebraic perspective, this accounting of 
all hxed points is using a third operation ©. In particular, we require that © also distribute over ©, 
forming a second commutative semiring. We refer to the this new semiring as SP semiring. 

Let p be a BP hxed point - that is 

P.^. = I e di pi^i = Pi^i(pg.^j^.), pi^i = 

and denote the set of all such hxed points by V. Each BP hxed point corresponds to an approximation to 
the q(0), which we denote by Q(p_^ )(0) - using this functional form is to emphasize the dependence of 
this approximation on BP messages. Recall that in the original problem, X is the domain of assignments, 
q(x) is the expanded form and ©-marginalization is (approximately) performed by BP. In the case of 
survey propagation, V is domain of assignments and the integral Q(p_^)(0) evaluates a particular 
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Belief Propagation 

domain: 

X 

Vi Xi 

expanded form: 

q(x) 

integration: 

q(0) = 0xq(®) 

marginalization: 

p(Xi) OC 0^\iP(T) 

factors: 

VI fi(xi) 


Table 2.1: The correspondence between BP and SP 

I Survey Propagation 

Pi^i , pi^i Vi,lG9i 

V 

I Q(P.^.)(9) 

I Q(»)m = ©p^ Q(E^.)(») 

I S(pi^i) « 


Vi, I G (9i 


assignment p_^ to all the messages - i.e., Q(p_^ )(0) is the new expanded form. 

In this algebraic perspective, SP efficiently performs a second integral using © over all fixed points: 

Q(9)(9) = ©- ,„Q(q.^.)(9) (2-48) 

Table 2.1 summarizes this correspondence. 

Our derivation requires (T*,©) to be an Abelian group (i.e., every element of y* has an inverse 
w.r.t. ©). The requirement for invertablity of © is because we need to work with normalized BP and 
SP messages. In section 2.5.4 we introduce another variation of SP that simply counts the BP fixed 
points and relaxes this requirement. 

2.5.1 Decomposition of the integral 

In writing the normalized BP equations in section 2.1, we hid the normalization constant using oc sign. 
Here we explicitly define the normalization constants or local integrals by defining unnormalized 
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messages, based on their normalized version 



del 

0fl(Tl) (0 Pj-^l{Xj) 

X\i jeai\i 


(2.49) 


del 

(0 pj^i(xi) 

JeaAi 


(2.50) 

Pife) 

def 

fi(s)0)Pi^i(a;*) 

iedi 

Pi(Psi^i)fe) 

(2.51) 

Piixi) 

def 

(0pi^i(xi) 

ledi 


(2.52) 


where each update also has a functional form on the r.h.s. In each case, the local integrals are simply 
the integral of unnormalized messages or marginals - e.g., pi^.i(0) = 0^, pi_ 5 >i(xi). 

Define the functional Pi<j>i(pi_ 5 >i, pi-j.^) as the product of messages from i to I and vice versa 

Pi^i(xi) (g) pi^i(xi) Pj^i(pi^i, pi^j)(xi) (2.53) 


Theorem 2.5.1. If the factor-graph has no loops and (3^*,(8)) is an Abelian group, the global integral 
decomposes to local BP integrals as 


q(0) 


®pi(0) (^Pi(0) 

I i 



or in other words q (0) = Q(p ^ )(0) where 


= (8)Pi(£ 3,^,)(«) (8)P.(Pa;^i)(») 

I i 


PjolCPi—Pi—^i)(0) 


I i,ls9* 


(2.54) 


(2.55) 


Proof. For this proof we build a tree around an root node r that is connected to one factor. (Since the 
factor-graph is a tree such a node always exists.) Send BP messages from the leaves, up towards the 
root r and back to the leaves. Here, any message qi_ 5 .i(xi), can give us the integral for the sub-tree that 
contains all the nodes and factors up to node i using qi_,.i(0) = 0^^qj_,.i(xi). Noting that the root is 
connected to exactly one factor, the global integral is 


^^^q(xr) — ^^^(^j^^^qi^r(Tr) — qi^r(0) 


(2.56) 


On the other hand. We have the following relation between qi_^i and Pi_^i (also corresponding 
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factor-to-variable message) 


qi^lixi) = Pi^i(xi) 0 qi^i(0) \/i, I G di 
qi^j(xi) = pi^i(xi) (g) qi^i(0) Vi, I G di 

Substituting this into BP equation (2.2) we get 


(2.57) 

(2.58) 


qi^i(xi) = (^ qj^i(0)pj^i(xj) 

(2.59) 

Je9Ai 


qi^i(xi) = 0fi(s) (^ qj^i(0)pj^i(xj} 

(2.60) 


j£dT\i 


By summing over both l.h.s and r.h.s in equations above and substituting from equation (2.50) we get 


0 <\i^l{Xi) 

Xi ^ 

qi^i(0) = Pi^i(0) (^ qj. 

JeaAi 


qj^i(0) ® 0^ (g) pj^i( 


JeaAi 


(2.61) 


and similarly for equation (2.60) using integration and substitution from equation (2.49) we have 


0 qi^i(xi) 

Xi 

qi^i(0) 


qj_,i(0) ® 0fi(s) (S) 


Pi^i(0) (^ qi^i(0) 

jedl\i 


ri ie9i\i 


(2.62) 


Equation (2.61) and 2.61 are simply recursive integration on a tree, where the integral up to node 
i (i.e., qj_>i(0) in equation (2.61)) is reduced to integral in its sub-trees. By unrolling this recursion we 
see that qj_^i(0) is simply the product of all pi_>.j(0) and pi_>.j(0) in its sub-tree, where the messages 
are towards the root. Equation (2.56) tells us that the global integral is not different. Therefore, 
equation (2.61) we can completely expand the recursion for the global integral. For this, let f i restrict 
the di to the factor that is higher than variable i in the tree (i.e., closer to the root r). Similarly let f I 
be the variable that is closer to the root than I. We can write the global integral as 


q(0) = (g).^j^^Pi^i(0)(g)j_^^^jPi^i(0) 


(2.63) 


Proposition 2.5.2 shows that these local integrals can be written in terms of local integrals of interest 
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- i.e.. 


Pi- 


and Pi-^i(0) = ^ 


Biol 


Bioc 


Substituting from the equations above into equation (2.63) we get the equations of theorem 2.5.1. □ 

Proof, (proposition 2.5.2) By definition of pi(xj) and Pi_ 5 .i(Ti) in equation (2.49) 


pi(a:i) = pioi(Ti) (g) Pioi(2;i) pi(xi) = pioi(a;i) <8) Pioi(Ti) 

Pi(0) = Pioi(0) ® Pioi(xi) (g) Pioi(Ti)^ Pi(0) = Pioi(0) ® Pi^i(0) 

where in the last step we used equation (2.53). 

Similarly for the second statement of the proposition we have 

Pi(xi) = Pioi(Ti) (g) pioi(a:i) ^ pi(xi) = ^ Pioi(Ti) ® Pioi(a;i) 

Xi Xi 

Pi(0) = Pioi(0) ® (0 Pioi(Ti) (g) Pioi(2;i)) ^ Pi(0) = Pioi(0) ® Pi^i(0) 

Xi 


—)• 


—)■ 


□ 


2.5.2 The new factor-graph and semiring 

The decomposition of integral in theorem 2.5.1 means Q(p_^ )(0) has a factored form. Therefore, a 
factor-graph with p^ as the set of variables and three different types of factors corresponding to 
different terms in the decomposition - he., Pi(Pgj_^j)(0), Bi(Pgj\j^j)(0) Pj<j.i(pi_>i, pi_>.j)(0)“^ can 
represent Q(p_^ )(0). 

Figure 2.6 shows a simple factor-graph and the corresponding SP factor-graph. The new factor-graph 
has one variable per each message in the original factor-graph and three types of factors as discussed 
above. Survey propagation is simply belief propagation applied to the this new factor-graph using the 
new semiring. As before BP messages are exchanged between variables and factors. But here, we can 
simplify BP messages by substitution and only keep two types of factor-to-factor messages. We use Si_^i 
and to denote these two types of SP messages. These messages are exchanged between two types 
of factors, namely Pi(Pgj_^j)(0) and Pi(Pg^\^j^-)(0)- Since the third type of factors Pio-i(Pi- 5 -i 5 Pi->-i)(0)~^ 
are always connected to only two variables, Pi^i and pi^i, we can simplify their role in the SP message 
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Figure 2.6: Part of a factor-graph (left) and the corresponding SP factor-graph on the right. The variables in 
SP factor-graph are the messages in the original graph. The SP factor-graph has three type of factors: (7jPi(.)(0), 

(II) Pi(.)(0) and (III)Pi^i{.)il)) . As the arrows suggest, SP message updates are simplified so that only two 
type of messages are exchanged: and Si^i between factors of type (I) and (II). 


update to get 


Si^l(pi^l, pi^i) oc ^ 
Si^i(pi^l, pi^i) oc ^ 


/ Pl(PaT^T)(0) 


—^^(p^—^J? PJ— 


\p,^i,Pi^i VP»«i(p 


.Sj—>i(pj—Pi— 


(2.64) 

(2.65) 


where in all cases we are assuming the messages p G P are consistent with each other - i.e., satisfy BP 
equations on the original factor-graph. Note that, here again we are using the normalized BP message 
update and the normalization factor is hidden using oc sign. This is possible because we assumed (8> has 
an inverse. We can further simplify this update using the following proposition. 

Proposition 2.5.2. /orp_^ G "P 



( 2 . 66 ) 

(2.67) 
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Proof. By definition of Pi(tj) and Pi_^i(xi) in equation (2.49) 


Pl((/)j = pi_ 


Pl(Xj) = Pl^j(Xi) (g) Pi^l(Xi) 
0 Pl^i(Xi) (g) Pi^i(x 

Xi 


® Pl(3;i) = Pl^j(Xi) (g) Pi^l(Xi) 

Xi Xi 

Pi(0) = Pi^*(0) (g) Pi^i(0) 


where in the last step we used equation (2.53). 

Similarly for the second statement of the proposition we have 


Pi(Xj) = p^^l{Xi) (g) pi^i(xi) 

Pi(0) = Pi^i(0) ® (0 Pi^iixi) (g) Pi^i(xi)) ^ 

Xi 


^ Pi(a:^i) = ^ Pi^i(xi) (g) pi^i(xi) 

Xi Xi 

Pi(0) = pi^i(0)(g)p.^j(0) 


—)■ 


□ 


The term on the l.h.s. in the proposition above appear in equation (2.64) and the terms on the 
r.h.s are local message integrals given by equation (2.49). We can enforce p_^ € V, hy enforcing BP 
updates Pi_i.i = Pi_^i(pg^^j^^) and pi_>.j = Pi_>.i(pgj^_^j) “locally”, during the message updates in the 
new factor-graph. Combining this constraint with the simplification offered by proposition 2.5.2 gives 
us the SP message updates 

S,^i(pi^i)cx 0 (^l(pi^i = Pi^i(pg.^j^.)) (g) Pi^i(Pg.^j^.)(0) (2-68) 

Si^i(pi^i) cx 0 G(pi^* = Pi^i(Pgj^.^j)) (g) Pi_^,(pgj,^^^j)(0) (2.69) 


® © 

where 1(.) is the identity function on the SP semiring, where 1(true) = 1 and I(false) = 1. 

Here each SP message is a functional over all possible BP messages between the same variable and 
factor. However, in updating the SP messages, the identity functions ensure that only the messages 
that locally satisfy BP equations are taken into account. Another difference from the updates of equa¬ 
tion (2.64) is that SP messages have a single argument. This is because the new local integrals either 
depend on or and not both. 

Example 2.5.1. In variational approach, survey propagation comes in two variations: entropic SP(^) 
and energetic SP(y) [208]. For the readers familiar with variational derivation of SP, here we express 
the relation to the algebraic approach. According to the variational view, the partition function of the 
entropic SP is g?iog(Q{p._>.)(0))^ where Q(p )(0) is the partition function for the sum-product 

semiring. The entropic SP has an inverse temperature parameter, a.k.a. Parisi parameter, G M. It 
is easy to see that ^ = 1 corresponds to © = and © = x in our algebraic approach. The 
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limits of ^ oo corresponds to © = max. On the other hand, the limit of ^ 0 amounts to ignoring 

Q(p_^ )(0) and corresponds to the counting SP; see section 2.5.4. 

The energetic SP(y) is different only in the sense that Q(p )(0) in g-yiog(Q(£.^.)(0)) jg 

ground state energy. This corresponds to © = +,© = max and © = X]) the limits of the inverse 
temperature parameter y —)• oo is equivalent to © = min, © = min and © = By taking an algebraic 
view we can choose between both operations and domains. For instance, an implication of algebraic 
view is that all the variations of SP can be applied to the domain of complex numbers T* = C. 

2.5.3 The new integral and marginals 

Once again we can use theorem 2.5.1, this time to approximate the SP integral Q(0)(0) = 0- Q(p )(0) 

il.— 

using local integral of SP messages. 

The SP marginal over each BP message Pi_i.i or pi_>.j is the same as the corresponding SP message 
- i.e., S(pi_ 5 .i) = Sj_>i(pi_ 5 .i). To see this in the factor-graph of figure 2.6, note that each message 
variable is connected to two factors, and both of these factors are already contained in calculating one 
SP messages. 

Moreover, from the SP marginals over messages we can recover the SP marginals over BP marginals 
which we denote by S(p)(xj). For this, we simply need to enumerate all combinations of BP messages 
that produce a particular marginal 

S(p)(xi) oc 0^ l(p(Ti) = P(Pg.^.)(xi))0^^^.Si^i(pi^i) (2.70) 

2.5.4 Counting survey propagation 

Previously we required the © operator to have an inverse, so that we can decompose the BP integral q(0) 
into local integrals. Moreover, for a consistent decomposition of the BP integral, SP and BP semiring 
previously shared the © operation.® 

Here, we lift these requirements by discarding the BP integrals altogether. This means SP semiring 
could be completely distinct from BP semiring and (T*, ©) does not have to be an Abelian group. This 
setting is particularly interesting when the SP semiring is sum-product over real domain 

Si^i(Pi^i) oc n Sj^i(pj^i) (2.71) 

Si->.j(pi->.i) OC l(pi_>.j = Pi_>.i(pQj^^^j)) Sj_^i(pj_^i) (2.72) 

®This is because if the expansion operation (g) was different from the expansion operation of BP, ig), the expanded form 
Q(p ) in the SP factor-graph would not evaluate the integral q(0) in the BP factor-graph, even in factor-graphs without 
any loops. 
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Here, the resulting SP integral Q(p ) = 1(P ^ simply “counts” the number of BP fixed 

points and SP marginals over BP marginals (given by equation (2.70)) approximates the frequency of 
a particular marginal. The original survey propagation equations in [48], that are very successful in 
solving satisfiability correspond to counting SP applied to the or-and semiring. 

Example 2.5.2. Interestingly, in all min-max problems with discrete domains X, min-max BP messages 
can only take the values that are in the range of factors - i.e., T* = T- This is because any ordered set 
is closed under min and max operations. Here, each counting SP message Sj_>i(pi_ 5 .i) : —)> M is a 

discrete distribution over all possible min-max BP messages. This means counting survey propagation 
where the BP semring is min-max is computationally “tractable”. In contrast (counting) SP, when 
applied to sum-product BP over real domains is not tractable. This is because in this case each SP 
message is a distribution over a uncountable set: Sj_^i(pj_ 5 .i) : —)> M. 

In practice, (counting) SP is only interesting if it remains tractable. The most well-known case 
corresponds to counting SP when applied to the or-and semiring. In this case the factors are constraints 
and the domain of SP messages is {true, false}I'^'L Our algebraic perspective extends this set of 
tractable instances. For example, it show that counting SP can be used to count the number of fixed 
points of BP when applied to xor-and or min-max semiring. 

2.6 Messages and particles 

The contrasting properties of stochastic and deterministic approximations make a general hybrid method 
desirable. After reviewing the basics of MCMC in section 2.6.1, we discuss some particle-based ap¬ 
proaches to message passing and introduce our hybrid inference method that combines message passing 
and Gibbs sampling in section 2.6.3. The discussions of this section are limited to sum-product inference. 

2.6.1 Markov Chain Monte Carlo 

Markov Chain Monte Carlo (MCMC) is a technique to produce samples from a target distribution p, 
by exploring a Markov Chain which is constructed such that more probable areas are visited more often 
[11, 227, 263]. 

A Markov Chain is a stochastic process , • • •, in which: 

p(xW I = kt(xW,x(*-i)) (2.73) 

that is the current state is independent of all the history, given only the previous state For 

a homogeneous Markov chain, the transition kernel k4(xW,x(* 1)) is the same for all t. In this case 
and under some assumptions^, starting from any arbitrary distribution x^^^ ~ after at least Tmix 

® The assumptions are: (I) Irreducibility: There is a non-zero probability of reaching all states starting with any 
arbitrary state and (II) Aperiodicity: The chain does not trap in cycles. 
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transitions by the chain, we have ~ p(t). Given a set of particles x[l],..., x[L] sampled from 

p(x), we can estimate the marginal probabilities (or any other expectation) as 

1 ^ 

p(xi) oc - ^ l(x[n]i = Xj) (2.74) 

n=l 

For a given transition kernel, the following condition, known as detailed balance, identifies the 
stationary distribution p: 

p(x^'^^)k(x*'*\x^*“^^) = p(;r^*“^^)k(x*'*“^\a:^*^) 

p(x^*^) = ^ p(t) k(x^*^; x) (2.75) 

X 

which means that p(.) is the left eigenvector of k(.,.) with eigenvalue 1. All the other eigenvalues are 
less than one and the mixing time, Tmix, of the chain depends on the second largest eigenvalue; the 
smaller it is, the faster consecutive transition by k(.,.) shrinks the corresponding components, retaining 
only p. 

Metropolis-Hasting Algorithm and Gibbs sampling 

Many important MCMC algorithms can be interpreted as a special case of Metropolis-Hasting (MH) 
[124, 207]. Similar to importance sampling, MH uses proposal distribution m(x^*) | x^'^”^)), but in this 
case, the proposal distribution is to help with the design of transition kernel k(.,.). After sampling x^*) 
from the proposal ~ ti(x), it is accepted with probability 

p(x(*))/p(x(*“^)) 

’ m(xh) I xh“^))/m(xh“^) I xh)) 

where, if the proposed sample is not accepted, 

The kernel resulting from this procedure admits the detailed balance condition w.r.t. the stationary 
distribution p(x). An important feature of MCMC, which allows for its application in graphical models, 
is the possibility of building valid transition kernels as the mixtures and cycles of other transition 
kernels. If p, is the stationary distribution for ki and k 2 , then it is also the stationary distribution for 
kik 2 (cycle) and Aki -|- (1 — A)k 2 , 0<A<1 (mixture) [295]. 

Cycling of kernels gives us Gibbs sampling in graphical models [109], when kernels are 

ki(xf\x^7^^) = p(xf^ 1 x5^7^^) Vi (2.77) 

where as before Ai is the Markov blanket of node i. It is also possible to use block MH-kernels 
with graphical models. In MH-samplers, when highly correlated variables are blocked together, mixing 


(2.76) 


min < 1 




68 


CHAPTER 2. APPROXIMATE INFERENCE 


properties improve. In fact, the Gibbs sampler is such a method, with the proposal distribution that 
results in acceptance with probability 1. 

Similar to the general MH-methods, Gibbs sampling can fail if the kernel does not mix properly. 
This could happen if variables are strongly correlated. In principle one can assemble neighbouring 
variables into blocks and update them as one [147]. However in difficult regimes the number of variables 
that should be flipped to move from one local optima to another, is in the order of total number of 
variables [208], which makes this approach intractable. 

Mixture of kernels can be used to combine a global proposal with a local proposal (e.g., [81, 88]). In 
fact if we could view a message passing operator as a transition kernel (at least when message passing 
is exact), then the mixture of kernels - e.g., with Gibbs sampling - could produce interesting hybrid 
methods. In section 2.6.3, by combining BP and Gibbs sampling operator (when rephrased as message 
update) we introduce a new hybrid method. 

2.6.2 Hybrid methods 

Stochastic methods are slow in convergence but they are guaranteed to converge. Even if the kernel 
is reducible, samples will cover a subset of the true support - i.e., MCMC still converges to a single 
sub-measure when the Gibbs measure is not unique. 

On the other hand, deterministic approximations are fast but non-convergent in difficult regimes. 
Modifications that result in convergence are either generally intractable {e.g., SP), slow {e.g., loop 
corrections and the methods that tighten a bound over the free energy) and/or degrade the quality of 
solutions. 

Moreover, sampling methods are flexible in representing distributions. This has motivated growing 
interest in nonparametric approach to variational inference [110] and in particular variations of Belief 
Propagation [141, 142, 144, 172, 231, 282, 283]. However, in the sense that these methods do not rely 
on a Markov chain for inference, they are closer to variational inference than MCMC methods. 

To better appreciate this distinction, consider two closely related methods: Gibbs Sampling and 
hard-spin mean-field [10], that uses the following update equation 

p{xi) oc Efi n 

5\i l&di j£l\i 

Interestingly, the detailed balance condition of equation (2.75) for Gibbs sampler gives us the same 
equation: 


p(xi) oc Efi PiXi I n 

^\i l&di j£l\i 


However, given enough iterations, Gibbs Sampling can be much more accurate than hard-spin mean 
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field method. Here the difference is that, with Gibbs sampler, this equation is enforced by the chain 
rather than explicit averaging of distributions, which means the correlation information is better taken 
into account. 

2.6.3 Perturbed Belief Propagation 

Consider a single particle x = x[l] in Gibbs Sampling. At any time-step t, Xi is updated according to 

xf^ ~ p(xi) oc n 
ledi 

Here we establish a correspondence between a particle in Gibbs Sampling and a set of variable-to- 
factor messages in BP -i.e., x {pj-s.i(.)}i,/e9n by defining all the messages leaving variable Xi as a 
delta-function 


Pi^i(xi) = l{xi = Xi) = Gi^i(p^.^g.){xi) VI G di (2.79) 

where Gj_^i(pAj_,.aj) is the Gibbs sampling operator that defines variable-to-factor message Pi_>.i as a 
function of all the messages from Markov blanket of i (Ai) to its adjacent factors (di). To completely 
define this random operator, note that Xi is a sample from the conditional distribution of Gibbs sampling 

Xi ~ p(xj) oc fi(a:i,T9i\i) 

j£di 

OC n Pj^iixj) (2.80) 

leai ^\i j&dl\i 


Combining the operators 

Now, lets write the BP updates for sum-product semiring once again; by substituting the factor-to- 
variable messages (equation (2.6)) into variable-to-factor messages and the marginals (equations (2.7) 
and (2.9)) we get 

Pi^i(xi) oc n E fj(Tj) OC Pj_>l(p^^_^g^)(Xi) 

p(Xi) OC HE h{xi) pj^iixj) 

l&di j£dl\i 

where, similar to equation (2.7), Pi_>i(.) denotes the message update operator, with the distinction that 
here, the arguments are also variable-to-factor messages (rather than factor-to-variable messages). 

By this rewriting of BP updates, the BP marginals equation (2.82) are identical in form to the Gibbs 
sampling distribution of equation (2.80). This similar form allows us to combine the operators linearly 


(2.81) 

(2.82) 
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to get perturbed BP operator: 

^ “ 7 )Pi^i(PAi^ai) Vi, I G di (2.83) 

The Perturbed BP operator Xj_ 5 .i(.) updates each message by calculating the outgoing message 
according to BP and GS operators and linearly combines them to get the final massage. During T 
iterations of Perturbed BP, the parameter 7 is gradually and linearly changed from 0 towards 1.^*^ 
Algorithm 2 summarizes this procedure. Note that the updates of perturbed BP are compatible with 
variable synchronous message update (see section 2 . 1 . 1 ). 

input : a factor graph, number of iterations T. 

output: a sample x. 

Initialize messages 
7-^0 

repeat 

for each variable Xi do 

calculate p(xj) using equation (2.80) 

calculate BP messages Pi_>./(.) using equation (2.81) VI G di 
sample Xi p(t*) 

combine BP and Gibbs sampling messages: 

Pi^i(a:i) ^ 7 Pi^i(a:j) + (1 - 7 ) l(xj = x*) (2.84) 

end 

7 ^ 7+ 

until T iterations 

return x 

Algorithm 2: Perturbed Belief Propagation 


Experiments 

Perturbed BP is most successful in solving CSPs; see chapter 3. However we can also use it for 
marginalization by sampling (equation (2.74)). Here we use the spin-glass Ising model on 8 x 8 grids 
and Erdos-Reny (ER) random graphs with N = 50 and 150 edges. We sampled local helds independently 
from A/'(0,1) and interactions from M{0, 0^), where we change 9 to control the problem difficulty - higher 
values correspond to more difficult inference problems. We then compared the average of the logarithm 

'^^Perturbed BP updates can also be used with any fixed 7 € [0,1]. 
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(base 10) of mean (over N variables) marginal error 




(2.85) 




3.0 -2.5 -2.0 -1.5 -1.0 -0.5 O.C 




0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 


(a) Ising grid 


(b) Random ER graph 


Figure 2.7: Log mean marginal error (x and y axes) comparison between Perturbed BP, BP and Gibbs sampling 
for (left) 8x8 periodic Ising grid; (right) random graph with 50 variables, 150 edges and spin-glass interactions. 
The size of each circle is proportional to the difficulty of that problem. 


Figure 2.7 compares perturbed BP, Gibbs sampling and BP where larger circles correspond to 
more difficult instances 9 G {0.5,1, 2,4, 8}. All methods are given a maximum of 10,000 iterations. 
Perturbed BP and Gibbs sampling use T = 100 iterations to obtain each sample, while BP is ran once 
until convergence or until the maximum number of iterations is reached. 

The results shows that Perturbed BP as a sampling method is generally better than Gibbs Sampling. 
Also, for some cases in which BP’s result is very close to random (i.e., ~ —.3 in log marginal error). 
Perturbed BP produces relatively better results. Note that for difficult instances, increasing T even 
by 100 folds does not significantly improve the results for either Gibbs sampling or perturbed BP. For 
Gibbs sampling, this can be explained by formation of pure states that result in exponential mixing 
time [192, 208]. 

2.6.4 Perturbed survey propagation 

When the SP operators are sum-product - i.e., © = sum and © = prod - we can apply a perturbation 
scheme similar to perturbed BP. 

Recall that sum-product SP defines a distribution over BP fixed point. Therefore sampling from 
this distribution amounts to randomly selecting a single BP fixed point. This corresponds to sampling 
a single message Pi_^i[l] ~ Sj_ 5 .i(pj^i)^^ and bias the SP message Sj_s.i(.) towards this random choice - 

^^Recall that the SP marginal over each message Pi^i is identical to the corresponding message Si->i(pi^.i). 
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i.e., (analogous to equation (2.84) in algorithm 2) 

Si^i(pi^i) ^ 7 Si^i(pi^i) + (1-7)l(pi^i = Pi^i[l]) (2.86) 

where Pi^i[l] ~ Sj^i(pi^i) 

An alternative form of perturbation is to perturb SP messages using implicit SP marginals. Recall 
that in using counting SP, the SP marginals over BP marginals (S(p)(xj); see equation (2.70)) are simply 
the frequency of observing a particular marginal in BP fixed points. This implicitly defines SP marginal 
over the original domains A*; Vi, which we denote by S{xi) 

S{xi) oc ^ S(p)(xi) (2.87) 

p 

After obtaining a sample Xi ~ S(xj), we bias all the outgoing SP messages accordingly 

Si^i(pi^i) ^ 7 Si^i(pi^i) + (1- 7 ) l(pi^i(.) = l(fi,.)) VI G 9i (2.88) 

where Xi ~ Si{xi) 

where, similar to perturbed BP, 7 is gradually increased from 0 to 1 during T iterations of Perturbed 
SP. We use this form of perturbation in section 3.3 to obtain a satisfying assignment x, to CSPs. We 
show that although computationally more expensive than perturbed BP, this method often outperforms 
all the other well-known methods in solving random CSPs. 




Part II 


Combinatorial problems 
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Message Passing algorithms of different semirings are able to solve a variety of combinatorial prob¬ 
lems: (I) To solve constraint satisfaction problems (CSPs) the sum-product message passing is 
often used, where p(^) defines a uniform distribution over solutions and the objective is to produce 
a single assignment x* s.t. p(x*) > 0. (II) Here the estimates of the partition function, either using 
the approximation given by the Bethe free energy (section 2.3.1) or the decomposition of integral in 
section 2.5.1, is used for approximate counting of the number of solutions. This estimate to the 
partition function is also used for integration problems such approximating the permanent of a matrix 
(see chapter 5). (Ill) The min-sum semiring is often used for (constrained) optimization and we 
formulate (IV) bottleneck problems as min-max inference. 

This part of the thesis studies the message passing solutions to combinatorial problems under three 
broad categories. (1) Chapter 3 studies the constraint satisfaction problems, where we use perturbed 
message passing (section 2.6.3 and 2.6.4) to produce state-of-the-art results in solving random instances 
of satisfiability and coloring problems in section 3.3. This chapter then studies several other NP-hard 
problems including set-cover, independent set, max-clique, clique-cover and packing for construction of 
non-linear codes. (2) Chapter 4 studies variations of clustering problems including k-median, k-center, 
k-clustering, hierarchical clustering and modularity optimization. (3) In chapter 5 we study problems 
that involve enumeration, constraint satisfaction or constrained optimization over permutations. This 
includes (bottleneck) travelling salesman problem, matching, graph alignment, graph isomorphism and 
finding symmetries. 

Note that this classification of combinatorial problems into three categories is superficial and is made 
solely to provide some organization. In several places we violate this categorization in favour of better 
flow. For example we study some constraint satisfaction problems such as (sub)-graph isomorphism and 
Hamiltonian cycle in chapter 5 rather than chapter 3. We investigate the “optimization” counterpart 
of some CSPs in chapter 3 and review message passing solutions to finding trees rather than clusters in 
chapter 4. Moreover, many of the graphical models presented here are proposed by other researchers and 
they are included here only for completeness. As a final remark, we note that many of the statements 
in the following are assuming P ^ NP. 



Chapter 3 

Constraint satisfaction 


We saw in section 1.3 that “any” semiring can formulate Constraint Satisfaction Problems (CSPs). In 
particular, as we saw in section 1.3, several semirings are isomorphic to the and-or ({false, true}, V, A) 
semiring and therefore result in equivalent BP procedures. The BP message update over the and-or 
semiring is called warning propagation (WP). WP marginals indicate whether or not a particular 
assignment to each variable is allowed, and therefore indicate a cluster of solutions. However, the success 
of warning propagation highly depends on initialization of messages. In contrast, if convergent, the fixed 
points of BP on the sum-product semiring x) are less dependent on initialization. 

Example 3.0.1. iP-coloring: Given a graph Q = (V,T), the K-coloring (K-COL) problem asks 
whether it is possible to assign one color (out of K) to each node s.t. no two adjacent nodes have the 
same color. Here, Xj G T} = {l,...,g} is a K-ary variable for each i G A/", and we have M = \£\ 
constraints; each constraint fij(xj,Xj) = l(xi / xj) depends only on two variables and is satisfied iff 
the two variables have different values. Here the identity function l(xj 7 ^ Xj) depends on the semiring 
(see section 1.3). 


{True, True, True} 



Figure 3.1: (a) The set of all possible assignments to 3 variables. The solutions to the 3-SAT problem of 
equation (3.1) are in white circles, (b) The factor-graph corresponding to the 3-SAT problem of equation (3.1). 
Here each factor prohibits a single assignment. 
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Example 3.0.2. iC-Satisfiability (K-SAT): Given a conjunction of disjunctions with K literals, 
K-satisfiability seeks an assignment that evaluates to true. Here, all variables are binary (Aj = 
{true, false}) and each clause (factor fj) depends on K =\ dl \ variables. A clause evaluates to 
zero only for a single assignment out of 2^ possible assignment of variables [107]. 

Consider the following (or-and semiring formulation of) 3-SAT problem over 3 variables with 5 
clauses: 


<l(2i) = {^Xi V -iXj V Xk) A (-iXj V Xj V Xk) A (xj V ^Xj V Xk) A (3-1) 

'-V-" '-V-" '-V-" 

fi fj Ik 

(-■Xi V Xj V -iXfc) A (xj V -iXj V -iXfc) 

'-V-" '-v-" 

II Ih 


The factor fj corresponding to the first clause takes the value 1 (for or-and semiring this corresponds 

A e V 

to 1 = true), except for Xj = (true, true, false), in which case it is equal to 1 (1 = false). Figure 3.1 

shows this factor-graph and its set of solutions: 

5 = { (true, true, true) , (false, FALSE, FALSE), (FALSE, FALSE, TRUE) } . 

When using sum-product semiring, where fi(x) G Ti = {0,1}, p(x) (equation (1.17)) defines a 
uniform distribution over the set of solutions and the partition function q(0) counts the number of 
solutions. The challenge is then to sample from this distribution (or estimate q(0)). The common 
approach to sample from p(x) is to use decimation (see equation (1.22)). Here one repeatedly applies 
sum-product BP to estimate marginals p(xi). Then one fixes a subset of variables x^ according to 
their marginals. For this one may sample x* ~ p(xi) or select Xj with maximum marginal x* = 
arg^,, maxp(xj). Sum-product BP is then applied to the reduced factor-graph in which fi(xi) is replaced 
by fi(^i)l(3iinA = iiiinA)- Tbis process, called sum-product BP-guided-decimation (BP-dec), is 
repeated to obtain a complete joint assignment x*. 

However, using BP-guided-decimation is solving a more difficult problem of marginalization. In fact, 
in equation (1.22) we showed how using decimation one may estimate the partition function (which is 
a problem). This suggests that decimation may not be the most efficient approach to solving NP- 
complete CSPs. Here, instead we consider using the perturbed belief propagation (section 2.6.3) to 
sample from the set of solutions, where the semiring used by perturbed BP is the same as sum-product 
BP-dec. 

To better understand warning propagation, sum-product BP-dec, and perturbed BP, when applied 
to CSPs, consider the following examples. 

Example 3.0.3. Here we apply three different message passing methods to solve the simple 3-SAT 
example of figure 3.1. 
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(I) Warning Propagation: 

We use the max-product semiring ({0,1}, max, prod) version of warning propagation for this example. 
As figure 3.1 suggests, the set of solutions S clusters into two subsets 

{{true, TRUE, true}} and {{false, false, false}, {false, false, true}}. Here, each of the clusters 
is a fixed point for WP - e.g., the cluster with two solutions corresponds to the following fixed point 


Pi_5.A(TRUE) = p(xi = true) = 0 


Pi_^A (false) = p(xj = false) = 1 

VA G di 

Pj^.A(TRUE) = p(Xj = true) = 0 


Pj_s.A(FALSE) = p{xj = false) = 1 

VA G dj 

Pfc^A(TRUE) = p(xfc = true) = 1 


Pfc^A (false) = p(xfc = false) = 1 

VA G dk 


where the messages indicate the allowed assignments within this particular cluster of solutions. De¬ 
pending on the initialization, WP messages may converge to any of its fixed points that also include 
the trivial cluster, where all (alternatively none) of the assignments are allowed. 

(II) BP-dec: 

Applying BP to this 3-SAT problem (starting from uniform messages) takes 20 iterations to converge - 
i.e., for the maximum change in the marginals to be below e = 10“®. Here the message, pi_^j(xi), from 
fi to Xi is: 


Pl^l(Xi) OC ^fl(xi^2,3) 

Similarly, the message in the opposite direction, pfc_^i(xi) is 

Pi^l(Xi) OC pj^i(Xi) PK^i(Xi) PL^i(Xi) PH^i(Xi) 

Here BP gives us the following approximate marginals: p(xi = true) = p(xj = true) = .319 
and p(xfc = true) = .522. From the set of solutions, we know that the correct marginals are p(xj = 
true) = p(xj = true) = 1/3 and p(xfc = true) = 2/3. The error of BP is caused by influential loops 
in the factor-graph of figure 3.1(b). Here the error is rather small; it can be arbitrarily large in some 
instances; sometimes it ca prevent converging at all. 

By fixing the value of Xi to false, the SAT problem of equation (3.1) collapses to: 


SAT{x^j ^ I Xi = false) = (-iXj V Xk) A (-iXj V ^Xk) 


( 3 . 2 ) 
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BP-dec applies BP again to this reduced problem, which give p(xj = true) = .14 (note here that 
p(xj = true) = 0) and p(xfc = true) = 1/2. By fixing Xj to false, another round of decimation yields 
a solution ^ = {false, false, true}. 

(Ill) Perturbed Belief Propagation: 

Perturbed BP can find a solution in T = 4 iterations (see algorithm 2). Our implementation shuffles 
the order of updates for variables in each iteration. 

In the first iteration, 7 = 0, which means updates are the same as that of sum-product BP. In 
the second iteration, the order of updates is Xj, x^, Xj and 7 = 1/3. At the end of this iteration 
p^*^^^(xj = true) = .38. Perturbed BP then samples Xj = false from this marginal. This sample 
influences the outgoing message according to the perturbed BP update equation (2.84), which in turn 
influences the beliefs for x, and Xk- At the end of this iteration (x, = true) = .20 and (x^ = 
true) = .53. At the final iteration 7 = 1 and the order of updates is Xi,Xj and x^. At this point 
p(t=3)(x^ = true) = .07 and the sample x* = false. This means the outgoing message is deterministic 
(i.e., pj_5.A(FALSE) = 1 and pj_5.A(TRUE) = 0, for all A G di). This choice propagates to select Xj = 
FALSE. Finally = true) = p(*^^)(xfc = true) = .5, which correctly shows that both choices 

for Xk produce a solution. 

To compare the performance of sum-product BP-dec and Perturbed BP on general CSPs, we con¬ 
sidered all CSP instances from XCSP repository [189, 270], that do not include global constraints or 
complex domains. All instances with intensive constraints (z.e., functional form) were converted into 
extensive format for explicit representation using dense factors. We further removed instances contain¬ 
ing constraints with more that 10® enteries in their tabular form. We also discarded instances that 
collectively had more than 10® enteries in the dense tabular form of their constraints. 

Figure 3.2(a,b) compares the time and iterations of BP-dec and Perturbed BP for successful attempts 
where both methods satisfied an instance. ^ 

Overall Perturbed BP, with 284 solved instances, is more successful than BP-dec with 253 successful 
runs. On the other hand, the average number of iterations for successful instances of BP-dec is 41, 284, 
compared to 133 iterations for Perturbed BP. This makes Perturbed BP 300 times more efficient than 
BP-dec. ® 

^ Since our implementation represents all factors in a dense tabular form, we had to remove many instances because of 
their large factor size. We anticipate that Perturbed BP and BP-dec could probably solve many of these instances using 
a sparse representation of factors. 

^ We used a convergence threshold of e = .001 for BP and terminated if the threshold was not reached after T = 
10 X = 10, 240 iterations. To perform decimation, we sort the variables according to their bias and fix p fraction of 
the most biased variables in each iteration of decimation. This fraction, p, was initially set to 100%, and it was divided 
by 2 each time BP-dec failed on the same instance. BP-dec was repeatedly applied using the reduced p, at most 10 times, 
unless a solution was reached - i.e., p = .1% at final attempt. 

For Perturbed BP, T = 10 at the starting attempt, which was increased by a factor of 2 in case of failure. This was 
repeated at most 10 times which means Perturbed BP used T = 10, 240 at its final attempt. Note that Perturbed BP at 
most uses the same number of iterations as the maximum iterations per single iteration of decimation in BP-dec. 

® We also ran BP-dec on all the benchmarks with maximum number of iterations set to T = 1000 and T = 100 






3.1. PHASE TRANSITIONS IN RANDOM CSPS 


79 




Figure 3.2: Comparison of number of iterations (left) and time (right) used by BP-dec and Perturbed BP in 
benchmark instances where both methods found satisfying assignments. 

3.1 Phase transitions in random CSPs 

Random CSP (rCSP) instances have been extensively used in order to study the properties of combi¬ 
natorial problems [1, 103, 179, 215] as well as in analysis and design of algorithms [213, 277]. 

Studies of rCSP, as a critical phenomena, focus on the geometry of the solution space as a function 
of the problem’s difficulty, where rigorous [2, 70] and non-rigorous [209, 210] analyses have confirmed 
the same geometric picture. 

When working with large random instances, a scalar a associated with a problem instance, a.k.a. con¬ 
trol parameter - e.g., the clause to variable ratio in SAT- can characterize that instance’s difficulty 
(i.e., larger control parameter corresponds to a more difficult instance) and in many situations it char¬ 
acterizes a sharp transition from satisfiability to unsatisfiability [59]. 

Example 3.1.1. Random AT-satisfiability Random AT-SAT instance with N variables and M = aN 
constraints are generated by selecting K variables at random for each constraint. Each constraint is set 
to zero (i.e., unsatisfied) for a single random assignment (out of 2^). Here a is the control parameter. 

Example 3.1.2. Random A'-coloring The control parameter for a random AT-COL instances with 
N variables and M constraints is its average degree a = We consider Erdos-Reny random graphs 
and generate a random instance by sequentially selecting two distinct variables out of N at random to 
generate each of M edges. For large N, this is equivalent to selecting each possible factor with a fixed 
probability, which means the nodes have Poisson degree distribution Pr(|5i| = d) (x 

iterations. This reduced the number of satisfied instances to 249 for T = 1000 and 247 for T = 100, but also reduced the 
average number of iterations to 1570 and 562 respectively, which are still several folds more expensive than Perturbed BP. 
see Appendix xyz for more details on these results. 
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While there are tight bounds for some problems [3], finding the exact location of this transition 
for different CSPs is still an open problem. Besides transition to unsatisfiability, these analyses have 
revealed several other (phase) transitions [179]. Figure 3.3(a)-(c) shows how the geometry of the set of 
solutions changes by increasing the control parameter. 

Here we enumerate various phases of the problem for increasing values of the control parameter: (a) 
In the so-called Replica Symmetric (RS) phase, the symmetries of the set of solutions (a.k.a. ground 
states) reflect the trivial symmetries of problem wrt variable domains. For example, for iC-COL the set 
of solutions is symmetric wrt swapping all red and blue assignment. In this regime, the set of solutions 
form a giant cluster (i.e., a set of neighboring solutions), where two solutions are considered neighbors 
when their Hamming distance is one [2] (or non-divergent with number of variables [210]. Local search 
methods {e.g., [277]) and BP-dec can often efficiently solve random CSPs that belong to this phase. 



Figure 3.3: A 2-dimensional schematic view of how the set of solutions of CSP varies as we increase the control 
parameter a from (left) replica symmetric phase to (middle) clustering phase to (right) condensation phase. Here 
small circles represent solutions and the bigger circles represent clusters of solutions. Note that this view is very 
simplistic in many ways - e.g., the total number of solutions and the size of clusters should generally decrease 
from left to right. 

(b) In clustering or dynamical transition (IdRSB^), the set of solutions decomposes into an 
exponential number of distant clusters. Here two clusters are distant if the Hamming distance between 
their respective members is divergent {e.g., linear) in the number of variables, (c) In the condensation 
phase transition (IsRSB'^), the set of solutions condenses into a few dominant clusters. Dominant 
clusters have roughly the same number of solutions and they collectively contain almost all of the 
solutions. While SP can be used even within the condensation phase, BP usually fails to converge 
in this regime. However each cluster of solutions in the clustering and condensation phase is a valid 
fixed-point of BP. (d) A rigidity transition (not included in figure 3.3) identifies a phase in which a 
finite portion of variables are fixed within dominant clusters. This transition triggers an exponential 
decrease in the total number of solutions, which leads to (e) unsatisfiability transition.® This rough 

^Ist order dynamical RSB. Symmetry Breaking is a general term indicating a phenomenon during which a system is 
breaking the symmetry that governs its behaviour by selecting a particular branch. The term Replica Symmetry Breaking 
(RSB) originates from the technique -i.e., Replica trick ([211])- that was first used to analyze this setting. According to 
RSB, the trivial symmetries of the problem do not characterize the clusters of solution. 

®lst order static RSB. 

®In some problems, the rigidity transition occurs before condensation transition. 
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Figure 3.4-' This schematic view demonstrates the clustering during condensation phase. Here assume x and y 
axes correspond to xi and Xj. Considering the whole space of assignments, xi and xj are highly correlated. The 
formation of this correlation between distant variables on a factor-graph breaks BP. Now assume that Perturbed 
BP messages are focused on the largest shaded ellipse. In this case the correlation is significantly reduced. 


picture summarizes first order Replica Symmetry Breaking’s (IRSB) basic assumptions [208]. 

3.1.1 Pitfalls of decimation 

Previously we gave an argument against decimation, based on the complexity of marginalization and 
integration. Some recent analyses draw similarly negative conclusions on the effect of decimation [71, 
218, 261]. The general picture is that at some point during the decimation process, variables form 
long-range correlations such that fixing one variable may imply an assignment for a portion of variables 
that form a loop, potentially leading to contradictions. Alternatively the same long-range correlations 
result in BP’s lack of convergence and error in marginals that may lead to unsatisfying assignments. 

Perturbed BP avoids the pitfalls of BP-dec in two ways: (I) Since many configurations have non-zero 
probability until the final iteration, perturbed BP can avoid contradictions by adapting to the most 
recent choices. This is in contrast to decimation in which variables are fixed once and are unable to 
change afterwards. Some backtracking schemes [240] attempt to fix this problem with decimation. (II) 
We speculate that simultaneous bias of all messages towards sub-regions, prevents the formation of 
long-range correlations between variables that breaks BP in IsRSB; see figure 3.4. 


3.2 Revisiting survey propagation 

SP is studied on random (hyper) graphs representing CSPs at thermodynamic limit (i.e., as iV —)• oo). 
Large random graphs are locally tree-like, which means the length of short loops are typically in the 
order of log(Ai) [145]. This ensures that, in the absence of long-range correlations, BP is asymptotically 
exact, as the set of messages incoming to each node or factor are almost independent. Although 
BP messages remain uncorrelated until the condensation transition [179], the BP equations do not 
completely characterize the set of solutions after the clustering transition. This inadequacy is indicated 
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by the existence of a set of several valid fixed points (rather than a unique fixed-point) for WP as an 
instance of BP. For a better intuition, consider the cartoons of figure 3.3(middle) and (right). During 
the clustering phase (middle), Xi and Xj (corresponding to the x and y axes) are not highly correlated, 
but they become correlated during and after condensation (right). This correlation between variables 
that are far apart in the factor-graph results in correlation between BP messages. This is because it 
implies that even if loops are long, they remain influential. This violates BP’s assumption that messages 
are uncorrelated, which results in BP’s failure in this regime. 

This is where survey propagation comes into the picture in solving CSPs. Going back to our algebraic 
notation for SP, using counting SP with warning propagation semiring ({0,1}, max, prod) as the initial 
semiring and sum-product (M-*^, sum, prod) as the SP semiring, is computationally tractable. This is 
because y* = {0,1} in the initial semiring is finite, and therefore each message can have finite number 
of 21'^* I values 


p,^i(a;,), Pi^.(.) G {(0,..., 0), (0,..., 0,1), (0,..., 1,0),..., (1,..., 0),..., (1,..., 1)} Vi, I G di 


This means each SP message is a distribution over these possibilities Sj_>i(pj^i) G However 

since (0 ,..., 0) indicates an unfeasible case, where no assignment is allowed, we explicitly ignore it in 
SP message updates. This gives us the following update equations and marginals for SP when applied 
to CSPs 


Si^i(Pi^i) oc ^ l(Pi^i(. 

)= n pj^i(-))( n Sj^i(pj^*)) vf,iGdi 

(3.3) 

—9i\I— 

Je9i\i JeaAi 


Si^i(pi^i) oc ^ l(pi^i(. 

) = J^fi(Ti) Yl Pi^i(-))( n Si^i(Pi^i)) 

(3.4) 

Psi\i-»I 

*\i j&dl\i je9I\i 


S(Pi) = T(Pi(-) = 

n pi^i(-)) n Si^i(pi^i) 

(3.5) 

—di—¥i 

l£di l£di 



Si^j((0 ,...,0)) = 0 and Si^i((0 ,..., 0)) = 0 


Example 3.2.1. Consider the SP message Sj_>i(pj_>i) in factor-graph of figure 3.1. Here the summation 
in equation (3.3) is over all possible combinations of max-product BP messages pj_>.jPK->.j,PL->-i)PH->-i- 
Since each of these messages can assume one of the three valid values - e.g., pj_^j(xi) G {(0,1), (1,0), (1,1)} 
- for each particular assignment of pj^i, a total of 3^ possible combinations are enumerated in the sum¬ 
mations of equation (3.3). However only the combinations that form a valid max-product message 
update have non-zero contribution in calculating Si_^i. 


3.2.1 Flavours of SP-guided decimation 

The SP-marginal over WP marginals (equation (3.5)) also implies a distribution S{xi) over the original 
domain (see equation (2.87)). Similar to BP-dec we can use either the implicit marginals of equa- 
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tion (2.87) or the SP marginals of equation (3.5) to perform decimation. In the former case, which we 
call SP-dec(S) we select x* = arg^,, maxS(xi) during decimation, and in the later case, which we call 
SP-dec(C), we clamp p* = arg^. maxS(pj). This means all the outgoing messages from this variable 
node in the factor-graph are clamped in the same way - i.e., Si_^i(pi^i) = p* VI G di. 

In the first case, SP-dec(S), we expect a single assignment x*, while for SP-dec(C) at the end of 
decimation we should obtain a cluster of solutions, where a subset of assignments is allowed for each 
Xi- However, during the decimation process (in both SP-dec(S) and SP-dec(C) ), usually after fixing a 
subset of variables, SP marginals, S{xi), become close to uniform, indicating that clusters of solution 
have no preference over particular assignment of the remaining variables. The same happens when 
we apply SP to random instances in RS phase (figure 3.3(left)). At this point (a.k.a. paramagnetic 
phase) solutions form a giant cluster and a local search method or BP-dec can often efficiently find an 
assignment to the variables that are not yet fixed by decimation. 

The original decimation procedure for iV-SAT [48] corresponds to SP-dec(S). SP-dec(C) for CSP 
with Boolean variables is only slightly different, as SP-dec(C) can choose to fix a cluster to p* = (1,1) in 
addition to the options of p* = ( 1 , 0 ) and Pi = ( 0 ,1) (corresponding to Xj = 0 and Xj = 1 respectively), 
available to SP-dec(S). However, for larger domains (e.g., iV-COL), SP-dec(C) has a clear advantage. 
For example, in 3-COL, SP-dec(C) may choose to fix a variable to Pi_s.i = (0,1,1) (i.e., the first color is 
not allowed) while SP-dec(S) can only choose between pj G {(0, 0,1), (0,1,0), (1,0,0)}. This significant 
difference is also reflected in their comparative success-rate on iV-COL. ' (See section 3.3) 

3.2.2 Computational Complexity 

The computational complexity of each SP update of equation (3.4) is C(2l'^*l — l)l^^l as for each particular 
value Pi-^-i, SP needs to consider every combination of incoming messages, each of which can take 
values (minus the empty set). Similarly, using a naive approach the cost of update of equation (3.3) 
is 0(2l'^'l — 1)I^*L However by considering incoming messages one at a time, we can perform the same 
exact update in 0{\di\ 2^1'^“!). In comparison to the cost of BP updates {i.e., 0{\di\ \Xi\) and C>(|Ai|) for 
two types of message update; see section 2 . 1 ), we see that SP updates are substantially more expensive 
for large domains \Xi\ and higher order factors with large |9I|. 

3.2.3 Perturbed survey propagation for CSP 

Similar to SP, we use perturbed SP with ({0,1}, max, prod) as the first semiring and (M, sum, prod) as 
the second semiring. Since perturbed SP seeks a single assignment, rather than a cluster of solutions, 
it can find satisfying solutions to paramagnetic instances. This is in contrast to SP-dec, which in 
paramagnetic cases returns a trivial WP fixed point in which all assignment are allowed. This means as 
opposed to SP-dec, which is mostly applied to random CSPs in the clustering and condensation phase, 
^Previous applications of SP-dec to A'-COL by [49] used a heuristic for decimation that is similar SP-dec (C). 
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perturbed SP can be used to solve non-random and also random instances in RS phase. 

To demonstrate this, we applied perturbed SP to benchmark CSP instances of figure 3.2, in which 
the maximum number of elements in the factor was less than 10.® Here perturbed SP solved 80 instances 
out of 202 cases, in comparison to perturbed BP that solved 78 instances, making perturbed SP slightly 
better, also in solving real-world problems. 


3.3 Satisfiability and coloring 


In examples 3.1.1 and 3.1.2, we introduced the random procedures that are often used to produce 
instances of iC-satisfiability and iC-coloring problems. 

Here we report the results on RT-SAT for K G {3,4} and iC-COL for K G {3,4,9}. We used the 
procedures to produce 100 random instances with N = 5, 000 variables for each control parameter a and 
here report the probability of finding a satisfying assignment for different methods - i.e., the portion of 
100 instances that were satisfied by each method.^ 

Figure 3.5(first row) visualizes the success rate of different methods on 3-SAT (right) and 3-COL 
(left), figure 3.5(second row) reports the number of variables that are fixed by SP-dec(C) and (S) 
before calling BP-dec as local search. The third row shows the average amount of time that is used 
to find a satisfying solution. This does not include the failed attempts. For SP-dec variations, this 
time includes the time used by local search. The final row of figure 3.5 shows the number of iterations 
used by each method at each level of difficulty over the successful instances. Note that this does not 
include the iterations of local search for SP-dec variations. Here the area of each disk is proportional 
to the frequency of satisfied instances with particular number of iterations for each control parameter 
and inference method^'^. 

Here we make the following observations: 

(I) Perturbed BP is much more effective than BP-dec, while remaining ten to hundreds of time 
more efficient. (H) As the control parameter grows larger, the chance of requiring more iterations to 
satisfy the instance increases for all methods. (HI) Although computationally very inefficient, BP-dec 
is able to find solutions for instances with larger control parameter than suggested by previous results 
(e.^., [208]). (IV) For many instances where SP-dec(C) and (S) use few iterations, the variables are 

®The number of iterations and other settings for perturbed SP were identical to the ones used to compare BP-dec and 
perturbed BP. 

®For coloring instances, to help decimation, we break the initial symmetry of the problem by fixing a single variable 
to an arbitrary value. For BP-dec and SP-dec, we use a convergence threshold of e = .001 and fix p = 1% of variables 
per iteration of decimation. Perturbed BP and Perturbed SP use T = 1000 iterations. Decimation-based methods use a 
maximum of T = 1000 iterations per iteration of decimation. If any of the methods failed to find a solution in the first 
attempt, T was increased by a factor of 4 at most 3 times - ie., in the final attempt T = 64,000. To avoid blow-up in 
run-time, for BP-dec and SP-dec, only the maximum iteration, T, during the first iteration of decimation, was increased 
(this is similar to the setting of [48] for SP-dec). For both variations of SP-dec (see section 3.2.1) after each decimation 
step, if maxima;. p(a:i) — < .01 (i.e., marginals are close to uniform) we consider the instance para-magnetic, and run 

BP-dec (with T — 1000, e = .001 and p = 1%) on the simplified instance. 

'^’^The number of iterations are rounded to the closest power of two. 
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Figure 3.5: (first row) Success-rate of different methods for 3-COL and 3-SAT for various control parameters, 
(second row) The average number of variables (out of N = 5000J that are fixed using SP-dec (C) and (S) before 
calling local search, averaged over 100 instances, (third row) The average amount of time (in seconds) used 
by the successful setting of each method to find a satisfying solution. For SP-dec(C) and (S) this includes the 
time used by local search, (forth row) The number of iterations used by different methods at different control 
parameters, when the method was successful at finding a solution. The number of iterations for each of 100 
random instances is rounded to the closest power of 2. This does not include the iterations used by local search 
after SP-dec. 
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fixed to a trivial cluster p* = (1,1,...,!), in which all assignments are allowed. This is particularly 
pronounced for 3-COL. For instances in which non-trivial fixes are zero, the success rate is solely due to 
local search (i.e., BP-dec). (V) While SP-dec(C) and SP-dec(S) have a similar performance for 3-SAT, 
SP-dec(C) significantly outperforms SP-dec(S) for 3-COL. 

Table 3.1 reports the success-rate as well as the average of total iterations in the successful attempts 
of each method, where the number of iterations for SP-dec(C) and (S) is the sum of iterations used 
by the method plus the iterations of the following BP-dec. Here we observe that perturbed BP can 
solve most of easier instances using only T = 1000 iterations {e.g., see perturb BP’s result for 3-SAT at 
a = 4., 3-COL at a = 4.2 and 9-COL at a = 33.4). The results also show that most difficult instances 
(that require more time/iterations) for each method approximately correspond to the control parameter 
for which half of the instances are satisfied. Larger control parameters usually result in early failure in 
satisfiability. 

Table 3.1 suggests that, as we speculated in section 3.2, SP-dec(C) is in general preferable to SP- 
dec(S), in particular when applied to the coloring problem. The most important advantage of Perturbed 
BP over SP-dec and Perturbed SP is that it can be applied to instances with large factor cardinality 
(e.g., 10-SAT) and variable domains (e.g., 9-COL). For example for 9-COL, the cardinality of each SP 
message is 2® = 512, which makes SP-dec and Perturbed SP impractical. Here BP-dec is not even able 
to solve a single instance around the dynamical transition (as low as a = 33.4) while perturbed BP 
satisfies all instances up to a = 34.1.^^ 

3.4 Clique-cover problem 

The K-clique-cover C = {Ci,... ,Ck} for a graph Q = (V,T) is a partitioning of V to at most K 
cliques - i.e., i,j G => (bj) £ 

NP-completeness of K-clique-cover can be proved by reduction from K-coloring [159]: A K-clique- 
cover for Q', the complement of G (i.e., Q' = (V,T' = {(i,j) \ (i,j) ^ '?}))) is a K-coloring for Q, where 
all the nodes in the same clique of Q' are allowed to have the same color in Q. 

The relation between K-clique-cover and K-coloring extends to their factor-graphs. While in K- 
coloring, factors j(xj, Xj) = l(xi ^ xj) V(i,j) G £ ensure that the connected nodes have different 
colors, for k-clique-cover factors f|jj}(xj, Xj) = l(xj ^ Xj) y(i,j) ^ £ ensure that nodes that are not 
connected can not belong to the same clique. Here Xj G {1,... ,K} represents the clique of node i. 

The factors, f^ijj(xi,Xj) = l(xj / Xj), in both K-clique-cover and K-coloirng are inverse Potts 
factors that allow efficient 0(K) calculation (see section 2.2.1). Using £(i,-) = {(i,j) G £} to denote 
the set of edges adjacent to node i, the following claim states the complexity of BP updates. 

Note that for 9-COL condensation transition happens after rigidity transition. So if we were able to find solutions 
after rigidity, it would have implied that condensation transition marks the onset of difficulty. However, this did not occur 
and similar to all other cases, Perturbed BP failed before rigidity transition. 
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Table 3.1: Comparison of different methods on {3, A}-SAT and {3,4, 9}-COL. For each method the success-rate 
and the average number of iterations (including local search) on successful attempts are reported. The approximate 
location of phase transitions are from [219, 330] . 


Problem 

Ctrl param a 

BP-dec 

SP-dec{C) 

SP-dec(S) 

Perturbed BP 

Perturbed SP 

avg. iters. 

success rate 

avg. iters. 

success rate 

avg. iters. 

success rate 

avg. iters. 

success rate 

avg. iters. 

success rate 

3.86 

dynamical and condensation transition 

4.1 

85405 99% 

102800 100% 

96475 100% 

1301 100% 

1211 100% 

4.15 

104147 83% 

118852 100% 

111754 96% 

5643 95% 

1121 100% 

4.2 

93904 28% 

118288 65% 

113910 64% 

19227 53% 

3415 87% 

3 S \T 

100609 12% 

112910 33% 

114303 36% 

22430 28% 

8413 69% 

4.23 

123318 5% 

109659 36% 

107783 36% 

18438 16% 

9173 58% 

4.24 

165710 1% 

126794 23% 

118284 19% 

29715 7% 

10147 41% 

4.25 

N/A 0% 

123703 9% 

110584 8% 

64001 1% 

14501 18% 

4.26 

37396 1% 

83231 6% 

106363 5% 

32001 3% 

22274 11% 

4.268 

satisfiability transition 

9.38 

dynamical transition 

9.547 

condensation transition 

9.73 

134368 8% 

119483 32% 

120353 35% 

25001 43% 

11142 86% 

4-SAT 9,75 

168633 5% 

115506 15% 

96391 21% 

36668 27% 

9783 68% 

9.78 

N/A 0% 

83720 9% 

139412 7% 

34001 12% 

11876 37% 

9.88 

rigidity transition 

9.931 

satisfiability transition 

4 

dynamical and condensation transition 

4.2 

24148 93% 

25066 94% 

24634 94% 

1511 100% 

1151 100% 

4.4 

51590 95% 

52684 89% 

54587 93% 

1691 100% 

1421 100% 

4.52 

61109 20% 

68189 63% 

54736 1% 

7705 98% 

2134 98% 

3 COL 

N/A 0% 

63980 32% 

13317 1% 

28047 65% 

3607 99% 

4.6 

N/A 0% 

74550 2% 

N/A 0% 

16001 1% 

18075 81% 

4.63 

N/A 0% 

N/A 0% 

N/A 0% 

48001 3% 

29270 26% 

4.66 

rigidity transition 

4.66 

N/A 0% 1 N/A 0% 1 N/A 0% | N/A 0% | 40001 2% 

4.687 

satisfiability transition 

8.353 

dynamical transition 

8.4 

64207 92% 1 72359 88% | 71214 93% | 1931 100% | 1331 100% 

ICOL 

dynamical transition 

8.55 

77618 13% 

60802 13% 

62876 9% 

3041 100% 

5577 100% 

8.7 

N/A 0% 

N/A 0% 

N/A 0% 

50287 14% 

N/A 0% 

8.83 

rigidity transition 

8.901 

satisfiability transition 

33.45 

dynamical transition 

33.4 

N/A 0% 

N/A N/A 

N/A N/A 

1061 100% 

N/A N/A 

33.9 

N/A 0% 

N/A N/A 

N/A N/A 

3701 100% 

N/A N/A 

34.1 

N/A 0% 

N/A N/A 

N/A N/A 

12243 100% 

N/A N/A 

9-COL 34.5 

N/A 0% 

N/A N/A 

N/A N/A 

48001 6% 

N/A N/A 

35.0 

N/A 0% 

N/A N/A 

N/A N/A 

N/A 0% 

N/A N/A 

39.87 

rigidity transition 

43.08 

condensation transition 

43.37 

satisfiability transition ^ 
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Claim 3.4.1. Each iteration of BP with variable-synchronous message update for K-clique-cover factor- 
graph is 0{K{N‘^ — |<?|)), while asynchronous message update is 0{KYfi^xi{N — \S{i, •)l)^)- 

Proof. Here the complexity of calculating factor-to-variable message is 0{K). Since there 

are — £ factors (one for each edge in Q') the total cost of factor-to-variable messages becomes 
0{K{N^-\£\)). 

The time complexity of each variable-to-factor message (pj^pj}) is 0(iA|Ai|), where Ai, Markov 
blanket of i in the factor-graph, is the set of nodes in V that are not adjacent to z in ^ - i.e., |Az| = 
N — £{i, ■). Using variable-synchronous update the total cost of variable-to-factor messages becomes 
N — |T(i,-)|) = 0{K{N'^ — 2|T|)). This means the cost of all messages in BP update is in 
the order of 0{K{N‘^ — |T|)). 

However, using asynchronous update, at each node i, we have to calculate N — \£{i, •)! messages. 
Since each of them is 0{N — |T(i,-)|), the total cost of variable-to-factor dominates the cost of BP 
update which is 0{K ~ l‘^(b Ol)^)- 

Our experimental results for K-clique-cover are within the context of a binary-search scheme, as the 
sum-prodnct reduction of the min-max formnlation of K-clustering. 


3.5 Dominating set and set cover 

The if-dominating set of graph Q = (V, £) is a subset of nodes P C V of size \T)\ = K such that any 
node in V \ P is adjacent to at least one member oiV ~ i.e., Vz G V \ P 3j £ V s.t. {i,j) G £. The 
dominating set problem is NP-complete [107] and has simple redactions to and from set cover problem 
[158]. As we see, the factor-graph formulations of these problems are also closely related. 




Figure 3.6: (left) an induced 2-set-cover problem and the solution "D = {i,k}. (right) The factor-graph 
representation of the same problem, where leader factors are grey squares, consistency factors are in black and 
the K-of-N factor is in white. 
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Given universe set V and a set of its subsets S = {Vi,..., Vm} s.t. Vm C V, we say CCS covers 
V iff each member of V is present in at least one member of C -i.e., IJv^eC ~ consider 

a natural set-cover problem induced by any directed graph. Given a directed-graph Q = (V,<S), 
for each node i € V, define a subset Vj = {j G V | (j, i) G f} as the set of all nodes that are 

connected to i. Let S = {Vi,..., Vv} denote all such subsets. An induced iL-set-cover of ^ is a set 
C C 5 of size K that covers V. Equivalently, The induced iL-set-cover of ^ = (V,T) is a subset of 
vertices T> CV^ with \T)\ = iL, such that every node not in T) is connected to at least one node in T). 
For example in figure 3.6(left), S = {{n, m, i}, {z,j}, {j, A:,/},{/, m, n}, {n}} and its induced solution 
C = {{n, m, i}, {j, fc, Z}} is indicated by grey nodes V = {i, k}. 

If we consider an undirected graph ^ as a directed graph with edges in both directions, then K- 
dominating set of G is equivalent to an induced K-set-cover problem on Q. Moreover given any K-set- 
cover problem instance S = {Vi,..., Vm}, we can construct a directed graph G such that the “induced” 
K-set-cover on G is equivalent to the given K-set-cover problem. For this, let V = (Uvme 5 
{wi,..., um} be the collection of nodes in S plus one node Um per each subset Vm G S. Now define the 
directed edges in S to connect every i G Vm to its representative Um ■ Moreover connect all representatives 
to each other in both directions - i.e., £ = {{i,Um) \ Vm,i G Vm} U {{um,Um') \ Vm 7 ^ m'}. It is easy 
to show that the induced K-set-cover on this directed graph defines a set-cover for S. 

3.5.1 Factor-graph and complexity 

For both problems we have one variable per edge Xi:j G {0,1} V(i, j) G £. Note that the G for induced 
K-set-cover problem is a directed graph, while the G for the K-dominating-set is undirected. This is 
the only difference that affects the factor-graph representation of these two problems. Here, Xi,j = 1 
indicates that node j € T> and node i is associated with node j. Three types of constraint factors ensure 
the assignments to Xi-,j define a valid solution to K-dominating-set and induced K-set-cover: 

• Leader factors ensure that each node i is associated with at least one node j (where j = i is 
admissible). Let £~^{i, •) = {{i,j) G U {(i, i)} be the set of edges leaving node i plus (i, i). Then 

f£+G-)fe+(i,-)) = 1(( Xi..j)>l) ViGV (3.6) 

is the leader factor associated with node i. 

• Consistency factors ensure that if node j is selected as the leader by node i, node j also selects 
itself as leader: 


= 0 V Xj-.j = 0) V(i, j) G £ 


(3.7) 


An alternative form of this factor is a high-order factor that allows efficient 0{\£{-,i)\) factor-to- 
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variable update 


= 1V Y1 ^r-i = 0) (3.8) 

• At most K-of-N factor ensures that at most K nodes are selected as leaders {\T)\ < K): 

^ ^ Xj-.i ^ K) (3-9) 

iev 

Figure 3.6 shows an example of induced K-set-cover problem and its corresponding factor-graph. 
In section 2.2.1 we saw that it is possible to calculate sum-product factor-to-variable BP messages for 
leader factors in 0{\£{i, •)!) and each at-most-K-of-N factor in 0{KN). This cost for consistency factors 
is 0(1) for the pairwise and 0(\£{-,i)\) for the alternative formulation. 

Claim 3.5.1. The time-complexity of message passing for the factor-graph above depending on the 
update schedule is 0(\£\ -\-KN) for factor-synchronous (1-sync; see section 2.1.1) update and 0{KN‘^-\- 
Xliev l‘^(b OP + l^(■)*)P) fox asynchronous update. 

Proof. We assume the consistency factors are in the higher order form of equation (3.8). Here, each 
variable Xi.,j is connected to at most three factors and therefore the cost of variable-to-factor messages is 
0(1). If we calculate factor-to-variable messages simultaneously, the cost is 0(X]iev Ol) leader 
and 0(EieV |T(-,i)|), giving a total of 0(\£\). Adding this to the cost of K-of-N factor the total cost 
per iteration of BP is 0{\£\ -\- KN). 

On the other hand, if we update each factor-to-variable separately, the previous costs are multiplied 
by |9I|, which gives 0{KN'^ + ^^^^\£{i,-)\‘^ + \£(-,i)\‘^). □ 

3.6 Clique problem, independent set and sphere packing 

Given graph Q, the K-clique problem asks whether Q contains a clique of size at least K. The K- 
clique problem is closely related to K-independent-set. Given graph Q = (V,£), the K-independent 
set problem asks whether V contains a subset of size at least K, s.t. there is no connection between 
nodes in 2? - i.e., yi,j € P {i,j) ^ The relation between K-clique problem and K-independent-set 
is analogous to the connection between K-coloring and K-clique-cover problems: the K-clique problem 
on 0 is equivalent to K-independent-set problem on its complement G'. 

K-independent-set is in turn equivalent to (N-K)-vertex cover problem. A vertex cover 2?' C V is 
a subset of nodes such that each edge is adjacent to at least one vertex in V. It is easy to see that V 
is an independent set iff 2?' = V \ 2? is a vertex cover. Therefore our solution here for independent set 
directly extends to vertex cover. 
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K-packing is a special case of sub-graph isomorphism, which asks whether a graph Qi is a sub-graph 
of Q 2 - For packing, Q 2 is the main graph {Q) and Qi is the complete graph of size K (see section 5.3) K- 
independent-set and K-clique problems are also closely related to sphere packing and finding nonlinear 
codes. To better motivate these problems, here we start with the problem of K-packing formulated as 
min-max problem in (several) factor-graphs. We then show that the sum-product reduction of K-packing 
is K-independent-set and K-clique problems. By doing this we simultaneously introduce message passing 
solutions to all three problems. 

Given a symmetric distance matrix A G between N data-points and a number of code¬ 

words K, the K-packing problem is to choose a subset of K points such that the minimum distance 
Ajj- between any two code-words is maximized. Here we introduce two different factor-graphs such that 
min-max inference obtains the K-packing solution. 

In order to establish the relation between the min-max problem and the CSPs above, we need the 
notion of ^-neighbourhood graph 

Definition 3.6.1. The y-neighborhood graph for (distance) matrix A G is defined as the 

graph Q{A,y) = (V,T), where V = {1,... ,iV} and £ = {{i,j) \ Aij < y}. 

3.6.1 Binary variable factor-graph 

Let binary variables x = {xi,..., xat} G {0,1}'^ indicate a subset of variables of size K that are selected 
as code-words. We define the factors such that the min-max assignment 

X* = arg3,min mg^ fl(lii) 

is the K-packing solution. Define the factor-graph with the following two types of factors: 

• K-of-N factor (section 2.2.1) 


W(*) = m'^Xi) = K) 
i&N 

ensures that K code-words are selected. Recall that definition of 1(.) depends on the semiring (equa¬ 
tion (1.20)). The K-packing problem is defined on the min-max semiring. However, since we plan to 
solve the min-max inference by sum-product reductions, it is important to note that the Py reduction 
of 1(.) for any y is 1(.) as defined for sum-product ring. 

• Pairwise factors are only effective if both Xi and Xj are non-zero 

f{jj}(xi, Xj) = min (l(xi = 0 V Xj = 0), max (l(xj = 1 A Xj = 1), —Ajj)) Vi,j (3.10) 


where the tabular form is simply 
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0 

1 

0 

— OO 

— OO 

1 

— (X) 

— A • • 


Here the use of — Aij is to convert the initial max-min objective to min-max/^ Recall that Q{A,y) 
defines a graph based on the distance matrix A, s.t. two nodes are connected iff their distance is “not 
larger than” y. This means Q{—A, —y) corresponds to a graph in which the connected nodes have a 
distance of “at least” y. 

Proposition 3.6.1. The Py-reduction of the K-packing faetor-graph above for the distance matrix 
A G defines a uniform distribution over the cliques of Q{—A, —y) of size K. 

Proof. Since Py(x) is uniform over its domain it is enough to show that: 

• Every clique of size K in Q{—A, —y) corresponds to a unique assignment x* with Py{x') > 0; 

Given a clique C C V of size K in Q{—A, —y) = (V, {(f, j) | Ajj > y}), define x* = {xi = ident{i G C) | 
i G V}. It is easy to show that Py{x*) > 0. For this we need to show that all the constraint factors in 
Py are satisfied. The K-of-N factor is trivially satisfied as \C\ = K. The p^-reduction of the pairwise 
factor of equation (3.10) becomes 

^{i,j}ixi,Xj) = l{xi = 0 V Xj = 0) + [l{xi = l^xj = l)l{-Aij < -y)) \/i,j (3.11) 

where we replaced min and max with -|- and . operators of the sum-product semiring and thresholded 
—Ajj by —y. To see that all pairwise constraint factors are satisfied consider two cases: (I) nodes 
i,j G C, and therefore x* = x* = 1. This also means i and j are connected and definition of Q{—A, —y), 
this implies Ajj > y. Therefore the second term in factor above evaluates to one. (II) either i or j are 
not in C, therefore the first term evaluates to one. Since both pairwise factors and the K-of-N factors 
for X* are non-zero Py{x*) > 0. 

• every assignment x* with Py{gf ) > 0 corresponds to a unique clique of size K in Q{—A, —y): 

equation (3.11) implies Xi = 1 A Xj = 1 Ajj > y. On the other hand, Py{x} > 0 means K-of-N 

factor is satisfied and therefore exactly K variables Xj are nonzero. Therefore the index of these variables 
identifies subset of nodes in Q(—A, —y) that are connected (because Ajj > y), forming a clique. □ 

Claim 3.6.2. The time-complexity of each iteration of sum-product BP for the factor-graph above is 

- 0{NK -\- |T|) for variable and factor synchronous update. 

- 0{N‘^K) for factor-synchronous (Psync) message update. 

- 0{N^) for completely asynchronous update. 

'^^The original objective is max-min because it aims to maximize the minimum distance between any two code-words. 
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Figure 3.7: Using message passing to choose AT = 30 out of N = 100 random points in the Euclidean plane to 
maximize the minimum pairwise distance (with T = 500 iterations for PBP). Touching circles show the minimum 
distance 

Proof. The cost of f-sync calculation of factor-to-variable messages for the K-of-N factor is 0{NK) 
and the cost of factor-to-variable messages for pairwise factors is 0(1). Since there are \£\ such factor, 
this cost evaluates to 0{NK + |T|). Since each node is adjacent to \£{i, •)! other nodes, the variable- 
synchronous update of variable-to-factor messages is 0(X]ieV l^(b ■)!) ~ which gives a total 

time-complexity of 0{NK + |T|). 

In asynchronous update, the cost of factor-to-variable messages for the K-of-N factor is 0{N‘^K) 
as we need to calculate each message separately. Moreover, updating each variable-to-factor message 
is C>(|T(i, •)), resulting a total of l‘^(b OP) = 0{N^). Since K < N, when all the updates are 

asynchronous, this cost subsumes the factor-to-variable cost of 0{N'^K). □ 

A corollary is that the complexity of finding the approximate min-max solution by sum-product 
reduction is 0{{NK -|- ITI) log(A^)) using synchronous update. Figure 3.7 shows an example of the 
solution found by message passing for K-packing with Euclidean distance. 

3.6.2 Categorical variable factor-graph 

Define the K-packing factor-graph as follows:Let x = {xi,...,xx} be the set of K variables where 
Xi G {1,...,A'}. For every two distinct points 1 < i < j < K define the factor 

i{ij}{xi,Xj) = max ( - A^^^^.,l{xi ^ xj)) (3.12) 

Here each variable represents a code-word and this factor ensures that code-words are distinct. Moreover 
if Xi and xj are distinct, i^ijj{xi,Xj) = —Ax^^xj is the distance between the nodes that Xi and Xj 
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represent. 

The tabular form of this factor is 
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The following proposition relates this factor-graph to K-clique problem. 

Proposition 3.6.3. The Py-reduction of the K-packing faetor-graph above for the distance matrix 
A G defines a uniform distribution over the cliques ofQ{—A, —y) of size K. 

Proof. Since Py defines a uniform distribution over its support, it is enough to show that any clique of size 
K over 0{—A, —y) dehnes a unique set of assignments all of which have nonzero probability (Pj/(x) > 0) 
and any assignment x with Py{g^ > 0 dehnes a unique clique of size at least K on Q{—A, —y). First 
note that the basic difference between Q{A,y) and Q{—A, —y) is that in the former all nodes that are 
connected have a distance of at most y while in the later, all nodes that have a distance of at least y 
are connected to each other. Consider the Pj^-reduction of the pairwise factor of equation (3.12) 

= 1 (max(-A^.,^^-, l(xi ^ Xj)) < y) (3.13) 

— 1 ( ^ y ^ ^ 

where, basically, we have replaced max from min-max semiring with (8) operator of the sum-product 
semiring and thresholded Ax^^xj- 

• Any clique of size K in Q(—A, —y), defines K unique assignments, such that for any such assignment 
fi*, Py{x*) > 0 .' 

For a clique C = {ci,..., ck} T V of size K, dehne x* = Cx-^ip where vr : {1,..., K} —)• {1,... ,K} 
is a permutation of nodes in clique C. Since there are K such permutations we may define as many 
assignments x*. Now consider one such assignment. For every two nodes x* and xt, since they belong 
to the clique C in 0(—A, —y), they are connected and Ax^^xj > U- This means that all the pairwise 
factors defined by equation (3.13) have non-zero values and therefore Py{x) > 0. 

• Any assignment x* with Py{x*) > 0 eorresponds to a unique clique of size K in Q{—A,—y): Let 

C = {x\,..., x*^}. Since Py{xf) > 0, all pairwise factors defined by equation (3.13) are non-zero. 
Therefore Vi, j 7 ^ i Ax xj > U, which means all Xi and Xj are connected in 0{—A, —y), forming a clique 
of size K. □ 
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To put simply, in acquiring the Pj^-reduction, we set the values in the table form above to zero if their 
value is less than y and set them to one otherwise. The resulting factor-graph, defines a distribution 
Pj/(x), s.t. Py{x) > 0 means x defines a clique of size iC in a graph G{—A, —y) which connects nodes 
with distance larger than y. 

Claim 3.6.4. Each iteration of BP for this factor-graph with pairwise factors is 0{N‘^K‘^), for syn¬ 
chronized update and 0{N‘^K^) for asynchronous update. Using the sum-product reduction of min-max 
inference this suggests a 0{N‘^K'^ log{N)) (for sync, update) and 0{N‘^\og{N)) (for asynchronous 
update) procedure for K-packing problem. 

Proof. (Claim 3.6.4) Since the factors are not sparse, the complexity of calculating a single factor-to- 
variable message is 0(|Ti|) = 0{N‘^), resulting in 0{N‘^K‘^) cost per iteration of variable-synchronous 
update for BP. However if we update each message separately, since each message update costs 0{N‘^K), 
the total cost of BP per iteration is 0{N'^K^) 

Since the diversity of pairwise distances is that of elements in A - i.e., |T| = 0{N‘^) - the general cost 
of finding an approximate min-max solution by message passing is 0{N‘^K‘^ log{N)) for sync, message 
update and 0{N^K^ \og{N)) for async. update. □ 

The Py-reduction of our second formulation was first proposed by [252] to find non-linear binary 
codes. The authors consider the Hamming distance between all binary vectors of length n {i.e., N = 2”’) 
to obtain binary codes with known minimum distance y. As we saw, this method is 0{N‘^K‘^) = 
0(2^”A^) -i.e., grows exponentially in the number of bits n. In the following section, we introduce a 
factor-graph formulation specific to categorical variables with Hamming distance whose message passing 
complexity is polynomial in re = log(A). Using this formulation we are able find optimal binary and 
ternary codes where both re and y are large. 

3.6.3 Efficient Sphere packing with Hamming distance 

Our factor-graph defines a distribution over the K binary vectors of length re such that the distance 
between every pair of binary vectors is at least y. 

To better relate this to K-clique problem, consider 2” binary code-words of length re as nodes of a 
graph and connect two nodes iff their Hamming distance is at least y. Finding a K-clique in this graph is 
equivalent to discovery of so-called nonlinear binary codes - a fundamental problem in information 
theory {e.g., see [75, 193]). Here assuming y is an odd number, if at most digits of a code-word 
are corrupted {e.g., in communication), since every pair of codes are at least y digits apart, we can 
still recover the uncorrupted code-word. The following is a collection of A = 12 ternary code-words 
of length re = 16, obtained using the factor-graph that we discuss in this section, where every pair of 
code-words are different in at least y = 11 digits. 

For convenience we restrict this construction to the case of binary vectors. A similar procedure may be used to find 
maximally distanced ternary and g-ary vectors, for arbitrary q. 
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Figure 3.8: The factor-graph for sphere-packing with Hamming distances, where the distance factors are white 
squares and z-factors are in black. 
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The construction of this factor-graph is more involved than our previous constructions. The basic 
idea to avoid the exponential blow up is to have one variable per digit of each code-word (rather than 
one variable per code-word). Then for each pair of code-words we define an auxiliary binary vector of 
the same length, that indicates if the two code-words are different in each digit. Finally we dehne an 
at-least-y-of-n constraint over each set of auxiliary vectors that ensures every two pair of code-words 
are at least different in y digits. Figure 3.8 shows this factor-graph. 

More specifically, let x = {xi-., ■ ■ ■, Xpc-.-} be a set of binary vectors, where x^.. = {xpi,... ,Xr,n} 
represents the binary vector or code-word. Additionally for each two code-words 1 < i < j < K, 
dehne an auxiliary binary vector = {zpj-i, ..., Zi-j-n} of length n. 

For each distinct pair of binary vectors x^.. and Xj.., and a particular digit 1 < k < n, the auxiliary 
variable is constrained to be Zi-j.^ = 1 iff xpk 7^ Xj-k- Then we dehne an at-least-y-of-n factor over z^.^.. 
for every pair of code-words, to ensure that they differ in at least y digits. 
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The factors are defined as follows 
• Z-factors: For every 1 < i < j < K and 1 < A: < n, define 

^{v.k,j\k,i-.j-.k}^^v.ki Xj-,ki Zi-j k) — 1 ^ Xj-k) A Zi-j-k — {{Xi-k — Xj-k) A Zi-j-k — O)) 


This factor depends on three binary variables, therefore we can explicitly define its tabular form con¬ 
taining 2^ = 8 possible inputs. Here the only difference between binary and ternary (and q-aiy) codes 
in general is in the tabular form of this factor. For example for ternary codes, the tabular for of this 
factor is a 3 X 3 X 2 array {zi-j-k is always binary). 

• Distance-factors: For each define at-least-y-of-n factor (section 2.2.1): 

= 1 ( ^ V) 

l<k<n 

Claim 3.6.5. Each iteration of sum-product BP over this factor-graph is 

- 0{K‘^ny) for variable and factor synchronous update. 

- 0{K‘^n^y) for variable-sync update. 

- 0{K^n + K‘^n^y) for completely asynchronous BP update. 

Proof. (Claim 3.6.5) We first consider the complexity of variable-to-factor updates: Each auxiliary 
variable Zi-^.^ is connected to three factors and each Xi^j is connected to 0{K) z-factors. Since there are 
0{nK), Xi:j variables a variable-sync, update of variable-to-factor messages is 0{nK‘^), while async. 
update is 0{nK^). 

Next we consider two possibilities of factor-to-variable updates: We have 0{nK‘^) z-factors, and 
factor-to-variable update for each of them is 0(1), this cost is subsumed by the minimum variable-to- 
factor cost. The factor-graph also has 0{K‘^) distance factors, where the cost of each factor-to-variable 
update is 0{ny). Since |5I| = iF for a distance factor, a factor-sync, update is 0{K‘^ny) in total, while 
an async. update is 0{K‘^n^y). 

Adding the cost of variable-to-factor and factor-to-variable in different scenarios we get the time- 
complexities stated in the claim. □ 

The following table reports some optimal binary codes (including codes with large number of bits 
n) from [193], recovered using this factor-graph. We used Perturbed BP with variable-sync update and 
T = 1000 iterations to find an assignment with p(x*,z*) > 0. 
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3.7 Optimization variations of CSPs 

This section briefly reviews the optimization variations of the CSPs, we have studied so far. The 
optimization version of satisfiability is known as max-SAT or maximum weighted SAT, where each 
clause has a weight, and the objective is to maximize the weighted sum of satisfied clauses. Here, 
simply using factors fi(xj) : Tj —>■ {0, —rci}, where wi is the positive weight of clause I, min-sum BP 
will attempt to find the max-SAT solution. Note that here fj is not a “constraint” factor anymore (see 
definition 1.3.2). Alternative approaches using variations of energetic survey propagation has also been 
used to improve max-SAT results [63, 64]. A less studied optimization variation of satisfiability is 
adversarial SAT which corresponds to min-max-sum inference [54]. 

For minimum coloring - a.k.a. chromatic number ~ and minimum clique-cover problem, since 
the optimal value 1 < K* < Km»^ is bounded, by access to an oracle for the decision version (or an 
incomplete solver [160] such as message passing), we can use binary search to find the minimum K in 
0 (t log(Armax)) time, where the decision problem has a 0{t) time-complexity. In particular, since the 
chromatic number is bounded by the maximum degree [51], approximating the chromatic number using 
binary search and message passing gives a 0(log(maxj(|T(f, •)!)) maxj(|T(i, •)!) |T|) time procedure. 

The same binary-search approach can be used for minimum dominating-set, minimum set-cover, 
maximum clique, maximum independent set and minimum vertex cover. However, these optimization 
variations also allow a more efficient and direct approach. Note that both min-max variation and 
the minimization (or maximization) variations of these problems use binary search. For example both 
minimum clique-cover and K-clustering (see section 4.3) can be solved using binary-search over K- 
clique-cover decision problem. The parameter of interest for binary search is K in the former case, and 
y in the later case, where y is a threshold distance that defines connectivity in Q (see definition 3.6.1). 
However, often both variations (f.e., min-max and minimization or maximization over K) also allow 
direct message passing solutions. 

For uiiuimum domiuatiug-set aud set-cover, we replace the sum-product semiring with min- 
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sum semiring and drop the K-of-N factor. Instead, a local factor = —Xi-AWi gives a weight to each 

node i G V. Here, the min-sum inference seeks a subset of nodes that form a dominating set and have 
the largest sum of weights Yliev Also note that by changing the semiring to min-sum, the identity 
functions 1(.) change accordingly so that the leader factor and consistency constraints remain valid. 
This gives an efficient 0{\£\) synchronous procedure for minimum set-cover and minimum dominating- 
set. This problem and the resulting message passing solution are indeed a variation of K-medians and 
affinity propagation respectively (see section 4.1). 

The same idea applies to maximum clique aud maximum iudepeudeut set: as an alternative 
to fixing or maximizing the “size” of a clique or an independent set, we may associate each node with a 
weight tCj Vi G V and seek a subset of nodes that form an independent set (or a clique) with maximum 
weight. Sanghavi et al. [274] study the max-product message passing solution to this problem and 
its relation to its LP-relaxation. In particular they show that starting from uniform messages, if BP 
converges, it finds the solution to LP relaxation. 

Here we review their factor-graph for maximum independent set using min-sum inference. Let 
X = {xi,..., xj\f} G {0, l}'^ be a set of binary variables, one for each node in V, where x* = 1 means 
i £ V, the independent set. 

• Local factors capture the cost (negative weight) for each node and is equal to —Wi if Xj = 1 and 
zero otherwise 


fi(xj) = min(-u;i, l(xi = 0)) Vi G V 
• Pairwise factors ensure that if (i,j) G T, then either Xj = 0 or Xj = 0 

= 0 V Xj = 0) V(i, j) G £ 

It is easy to see that using a variable synchronous update, message passing can be performed very 
efficiently in 0{\£\). 

Weigt and Zhou [309] (also see [333]) propose an interesting approach to miuimum vertex cover 
using energetic survey propagation. Here, again x = {xi,.. . ,XAr} G {0, 1}'^ has one binary variable 
per node i G V and a pairwise factor ensure that all edges are covered by at least one node in the cover 

f{ij}(®{ij}) = ^Xi = 1V Xj = 1) V(i, j) G £ 

Using the xor-and semiring, the resulting fixed points of warning propagation reveal minimal (but not 
necessarily optimal) vertex covers. The authors then suggest using survey propagation, and decimation 
to find the warning propagation fixed points with lowest energy [i.e., smallest size of cover). 



Chapter 4 

Clustering 


Clustering of a set of data-points is a central problem in machine learning and data-mining [4, 102, 180, 
184, 191, 230, 236, 248, 251, 266, 318, 322], However many interesting clustering objectives, including 
the problems that we consider in this section are NP-hard. 

In this section we present message passing solutions to several well-known clustering objectives 
including K-medians, K-clustering, K-centers and modularity optimization. Message passing has also 
been used within Expectation Maximization to obtain some of the best results in learning stochastic 
block models (a hidden variable model for clustering) [82]. The message passing solution to K-medians 
and its generalization, clustering by shallow trees are proposed by other researchers. However, for 
completeness, we review their factor-graphs in sections 4.1 and 4.2. expressing K-clustering and K- 
centers problems as min-max inference problems on factor-graphs in sections 4.3 and 4.4. 

4.1 K-medians 

Given a symmetric matrix of pairwise distances A G between N data-points, and a number of 

clusters K, K-medians seeks a partitioning of data-points into K clusters, each associated with a cluster 
center, s.t. the sum of distances from each data-point to its cluster center is minimized. 

This problem is NP-hard; however, there exists several approximation algorithms for metric distances 
[15, 58]. Here we present the binary variable factor-graph [112] for a slightly different version of this 
objective, proposed by Frey and Dueck [102]. The simplified form of min-sum BP messages in this 
factor-graph is known as affinity propagation. Here, instead of fixing the number of clusters, K, the 
objective is modified to incorporate each cluster’s cost to become a center. This cost is added to the 
sum of distances between the data-points and their cluster centers and used to decide the number of 
clusters at run-time. 

Let A be the distance matrix of a directed graph G = (V, £), where {i,j) ^ ^ Ajj = oo. Moreover 

let Aj^j denote the willingness of node i to be the center of a cluster. A simple heuristic is to set this 
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value uniformly to mediauj^j Ajj. 

Define x = V (i, j) G <5} as a set of binary variables - one per each directed edge (i, j) - where 
Xi-j G {0,1} indicates whether node i follows node j as its center. Here node i can follow itself as center. 
The following factors define the cost and constraints of affinity propagation: 

• Leader factors ensure that each node selects exactly one cluster center. 

= 1) VzGV 

where as before .) = {(*, j) G T} U is the set of edges leaving node i. 

• Consistency factors as defined by equation (3.8) ensure that if any node i G £{-,j) selects node j 
as its center of cluster, node j also selects itself as the center 

At this point we note that we used both of these factor-types for set-cover and dominating set 
problem in section 3.5. The only addition (except for using a different semiring) is the following factors. 

• Local factors take distances and the willingness to become a center into account 

Hxi:j = 0)) y{i,j) G TU {(i,i) G V} 

where in effect fi:j(xi:j) is equal to Ajj if Xi-,j = 1 and 0 otherwise 

See equation (1.20) for definition of 1(.) in min-sum semiring and note the fact that © = min in this 
case. This means extensions to other semirings {e.g., min-max) need only to consider a different 1(.) 
and use their own © operator. However, direct application of min-max inference to this factor-graph 
is problematic for another reason: since the number of clusters K is not enforced, as soon as node i 
becomes center of a cluster, all nodes j with Ajj < Aj^j can become their own centers without increasing 
the min-max value. In section 4.4, we resolve this issue by enforcing the number of clusters K and use 
inference on min-max semiring to solve the corresponding problem known as K-center problem. 

The complexity of min-sum BP with variable and factor synchronous message update is 0{\£\) as 
each leader and consistency factor allows efficient j)\) and 0{\£{i,.)\) calculation of factor-to- 

variable messages respectively. Moreover, all variable-to-factor messages leaving node i can be calculated 
simultaneously in 0{\£{i, .)|) using variable-synchronous update of equation (2.13). 

4.1.1 Facility location problem 

A closely related problem to K-medians is the facility location problem, where a matrix A G mID|xT 2 | 
specifies the pairwise distance between two parts of bipartite graph Q = {V = (Vi, V 2 ),T). The goal is 
to select a sub-set of facilities D C Vi to minimize the sum of distances of each customer i G V 2 to its 
associated facility in D. 

The uncapacitated version of this problem (no restriction on |D|) can be solved as special K-median 
problem where Aj^j = 00 Vz G V 2 . Message passing solution for min-sum variations of this problem are 
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discussed in [187, 188]. We discuss the min-max facility location as a special case of k-center problem 
in section 4.4. 

4.2 Hierarchical clustering 

By adding a dummy node * and connecting all the cluster centers to this node with the cost 
we can think of the K-median clustering of previous section as finding a minimum cost tree of depth 
two with the dummy node as its root. The shallow trees of Bailly-Bechet et al. [18] generalize this 
notion by allowing more levels of hierarchy. The objective is to find a tree of maximum depth d, that 
minimizes the sum over all its edges. An alternative approach to hierarchical clustering based on nested 
application of affinity propagation is discussed in [111, 289]. 

Previously we presented the binary-variable model for affinity propagation. However, it is possible 
to obtain identical message updates using categorical variables [102]. Here x = {xi,...,xw}, where 
Xi G Xi = {j \ {i,j) G £} selects one of the neighbouring nodes as the center of the cluster for node i. 
In this case we can ignore the leader factors and change the consistency and local factors accordingly. 

The idea in building hierarchies is to allow each node i to follow another node j even if i itself is 
followed by a third node k (i.e., dropping the consistency factors). Also Xi = i is forbidden. However, 
this creates the risk of forming loops. The trick used by Bailly-Bechet et al. [18] is to add an auxiliary 
“depth” variable Zi G {0 ,... ,d} at each node. Here, the depth of the dummy node * is zero and the 
depth factors ensure that if i follows j then Zi = zj + 1: 

U,j{Xi,Zij) = l(^Zi = Zj + 1V Xi ^ j 

4.2.1 Spanning and Steiner trees 

Although finding trees is not a clustering problem, since one could use the same techniques used for 
clustering by shallow trees, we include it in this section. Given a graph Q = (V,T), a penalty 
Wi-.j < 0 per edge (i,j) G £ and a prize tCj > 0 per node i G V, the prize-collecting Steiner tree’s 
objective is to select a connected sub-graph Q' = (V^ C V,£' C £) with maximum sum of prizes 
+ E{i, Wi:j- Since the optimal sub-graph is always a tree, a construction similar to that of 
shallow trees (with min-sum BP) hnds high-quality solutions to depth-limited versions of this problem 
in 0(41^1) [19, 20, 23, 31]. 

The main difference with factor-graph of shallow trees is that here the root node is a pre-determined 
member of V and several tricks are used to hnd best (set of) root(s). However, since the jV'J may be 
smaller than JV], a different dummy node * is introduced, with zero cost of connection to all nodes i G V 
s.t. Xi = * means node i is not a part of the Steiner tree (i G V \ V'). 

Alternatively, if we do not introduce this dummy node such that = V and moreover, set the node 
penalties to zero, the result is a depth limited spanning tree. Bayati et al. [24] show that if the 
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maximum depth is large enough {e.g., d = N) and BP is convergent, it will find the minimum spanning 
tree. 


4.3 K-clustering problem 

Given a symmetric matrix of pairwise distances A G between N data-points, and a number of 

clusters K, K-clustering (a.k.a. min-max clustering) seeks a partitioning of data-points that minimizes 
the maximum distance between all the pairs in the same partition. 

We formulate this problem as min-max inference problem in a factor-graph. Let x = {xi,..., xn} 
with Xi G {1,... ,K} be the set of variables, where Xi = k means, point i belongs to cluster k. The 
Potts factor 


f{ij}ixi,Xj) = m.m[l{xi ^ Xj),Aij) \/l < i < j < N (4.1) 

is equal to Aij if two nodes are in the same cluster and —oo otherwise. It is easy to see that using 
min-max inference on this factor graph, the min-max solution x* (equation (1.26)) defines a clustering 
that minimizes the maximum of all inter-cluster distances. 

Recall that the y-neighborhood graph (definition 3.6.1) for a distance matrix A is the graph 
= (V,T(A, 2 /) = {{i,j) I Aij < y}). 

Claim 4.3.1. The Py-reduction of the min-max clustering factor-graph above is identical to K-clique- 
cover factor-graph of section S.f for Q{A,y). 

Proof. The p^-reduction of the K-clustering factor (equation (4.1)) is 

f{ij}{xi,Xj) = 1 (min (l(xi / Xj),Aij) < y) = l{xi 7 ^ xj V Aij < y) 

Recall that the K-clique-cover factor-graph, defines a pairwise factor between any two nodes i and 
j whenever (i,j) ^ £{A,y), - i.e., whenever Aij < y it does not define a factor. However, A^j < y 
means that the reduced constraint factor above is satisfied and therefore we only need to consider the 
cases where A^j > y. This gives f|jjj(xi,Xj) = l(xj / Xj), which is the K-clique-cover factor between 
two nodes i and j s.t. {i,j) ^ £{A,y). □ 

We use binary-search over sum-product reductions to approximate the min-max solution (see equa¬ 
tion (1.25)). Considering the 0{N‘^K) cost of message passing for K-clique-cover, this gives 0{N‘^K log(A^)) 
0{N‘^K\og{N)) cost for K-clustering, where the log(A^) is the cost of binary search over the set of all 
possible pairwise distance values in A. 
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Figure 4.1 compares the performance of min-max clustering using message passing to that of Furthest 
Point Clustering (FPC) [119] which is 2-optimal when the triangle inequality holds. Note that message 
passing solutions are superior even when using Euclidean distance. 



20 30 

number of clusters 



Figure 4 .I: Min-max clustering of 100 points with varying numbers of clusters (x-axis). Each point is an average 
over 10 random instances. The y-axis is the ratio of the min-max value obtained by sum-product reduction (and 
using perturbed BP with T = 50 to find satisfying assignments ) divided by the min-max value of Furthest Point 
Clustering (FPC). (left) Clustering of random points in 2D Euclidean space. The red line is the lower bound on 
the optimal result based on the worst case guarantee of FPC. (right) Using symmetric random distance matrix 
where Aij = Aj^i ~ 17(0,1). 


4.4 K-center problem 

Given a pairwise distance matrix D G , the K-center problem seeks a partitioning of nodes, with 

one center per partition such that the maximum distance from any node to the center of its partition 
is minimized. 

This problem is known to be NP-hard, even for Euclidean distance matrices [201], however K- 
center has some approximation algorithms that apply when the distance matrix satishes the triangle 
inequality [87, 135] . The method of Dyer and Erieze [87] is very similar to furthest point clustering and 
extends to weighted K-center problem in which the distance from any point to all the other points is 
scaled by its weight. The more general case of asymmetric distances does not allow any constant-factor 
approximation (however o(log(AI))-approximation exists [66, 237]). 

Here we define a factor-graph, whose min-max inference results in the optimal solution for K-center 
problem. Eor this consider the graph Q{V,£) induced by the distance matrix A s.t. V = {1,... ,N} 
and Ajj = 00 {i,j) ^ S. 

Let X = {xpj \ {i,j) G £} be the set of variables, where xpj G {0,1} indicates whether j is the center 
of cluster for i. Dehne the following factors: 
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Figure 4-2: The factor-graph for K-center problem, where the local factors are black squares, the leader factors 
are light grey, consistency factors are white and K-of-N factor is dark grey. 

• Local factors: 


= min (Ajj, = 0)) V(z, j) £ £ 

• Leader, Consistency and at-most-K-of-N factors as defined for the sum-product case of induced 
K-set-cover and K-independent set (section 3.5). Here we need to replace sum-product 1(.) with the 
min-max version (equation (1.20)). 

For variants of this problem such as the capacitated K-center, additional constraints on the maxi¬ 
mum/minimum points in each group may be added as the at-least/at-most K-of-N factors. 

Claim 4.4.1. The Py-reduction of the K-center elustering factor-graph above is identieal to K-set-eover 
faetor-graph of seetion S.5 for Q{A, y). 

Proof. Leader, consistency and at-most-K-of-N factors are identical to the factors used for K-set-cover 
(which is identical to their p^-reduction), where the only difference is that here we have one variable 
per each {i,j) £ £ whereas in K-set-cover for ^(A, y) we have one variable per (i, j) £ £ \ Aij < y. By 
considering the p^-reduction of the local factors 

U.jixpj) = 1 (min (Ajj, l(xjy) = 0)) < y) = l(Ajj < y V xpj = 0) 

we see that we can assume that for Aij > 0, xpj = 0 and we can drop these variables from the factor- 
graph. After this we can also omit the p^-reduction of the local factors as they have no effect. This 
gives us the factor-graph of section 3.5 for set-cover. □ 

Similar to the K-clustering problem, we use the sum-product reduction to find near-optimal solutions 
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Figure 4-3: (left) K-center clustering of 50 random points in a 2D plane with various numbers of clusters 
(x-axis). The y-axis is the ratio of the min-max value obtained by sum-product reduction (T = 500 for perturbed 
BP) over the min-max value of 2-approximation of [87]. (right) Min-max K-facility location formulated as an 
asymmetric K-center problem and solved using message passing. Yellow squares indicate 20 potential facility 
locations and small blue circles indicate 50 customers. The task is to select 5 facilities (red squares) to minimize 
the maximum distance from any customer to a facility. The radius of circles is the min-max value. 


to this problem. The binary search seeks the optimal y € y (where y is the collective range of all the 
factors). Here, since only local factors take values other than Too, the search is over their range, which 
is basically all the values in A. This adds an additional log(A) multiplicative factor to the complexity 
of K-set-cover (which depends on the message update). 

We can significantly reduce the number of variables and the complexity by bounding the distance 
to the center of the cluster y. Given an upper bound y, we may remove all the variables xpj where 
Ajj > y from the factor-graph. Assuming that at most R nodes are at distance Ajj < y from every 
node j, the complexity of min-max inference with synchronous update drops to 0{NR?‘\og{N)). This 
upper bound can be obtained for example by applying approximation algorithms. 

Figure 4.3(left) compares the performance of message-passing and the 2-approximation of [87] when 
triangle inequality holds. The min-max facility location problem can also be formulated as an asym¬ 
metric A-center problem where the distance to all customers is oo and the distance from a facility to 
another facility is —oo (figure 4.3(right)). 


4.5 Modularity maximization 

A widely used objective for clustering (or community mining) is Modularity maximization [229]. How¬ 
ever, exact optimization of Modularity is NP-hard [45]. Modularity is closely related to fully connected 
Potts graphical models [259, 332]. Many have proposed various other heuristics for modularity optimiza¬ 
tion [34, 67, 228, 259, 265]. Here after a brief review of the Potts model in section 4.5.1, we introduce 
a factor-graph representation of this problem that has a large number of factors in section 4.5.2. We 
then use the augmentation technique (section 2.2.2) to incrementally incorporate violated constraints 
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in these models. 


4.5.1 The Potts model 

Let Q = (V,iS) be an undirected graph, with M = \£\, N = |V| and the adjacency matrix B G 
where Bjj / 0 (uj) G £■ Let A be the normalized adjacency matrix where ~ ^Iso 

let denote the normalized degree of node Vi. Graph clustering using modularity 

optimization seeks a partitioning of the nodes into unspecified number of clusters C = {Ci,... ,Ck}, 
maximizing 

= E E (Am - CA^(v)A^(t-)) (4-2) 

CiSC LieCi ^ 

The first term of modularity is proportional to within-cluster edge-weights. The second term is propor¬ 
tional to the expected number of within cluster edge-weights for a null model with the same weighted 
node degrees for each node i. Here the null model is a fully-connected graph. The resolution param¬ 
eter ^ - which is by default set to one - influences the size of communities, where higher resolutions 
motivates a larger number of clusters. 

In the Potts model, each node i G V is associated with a variable Xj G {1,..., Amax}, where 
is an upper bound on the number of clusters. Here, each pair of variables have a pairwise interaction 




Xi = Xj 

0 Xi ^ Xj 


(4.3) 


where min-sum inference on this fully connected factor-graph gives an assignment of each node to a 
cluster so as to maximize the modularity. 


4.5.2 Clique model 

Here, we introduce an alternative factor-graph for modularity optimization. Before introducing our 
factor-graph representation, we suggest a procedure to stochastically approximate the null model using 
a sparse set of interactions. 

We generate a random sparse null model with < aM weighted edges by randomly 

sampling two nodes, each drawn independently from Pr(i) oc and connecting them with a 

weight proportional to BW^l qc y^A^(~)A^(^. If they have been already connected, this weight is 
added to their current weight. We repeat this process aM times, however since some of the edges 
are repeated, the total number of edges in the sparse null model may be less than aM. Finally the 
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normalized edge-weight in the sparse null model is 


A NULL 


del 


■rjNULL 


E tdnull 
k,l ^k,l 

It is easy to see that this generative process in expectation produces the fully connected null model. ^ 


Factor-graph representation 

Here we use the following binary-valued factor-graph formulation. Let x = {xi-j G {0,1} | {i,j) G 
£■ U £;null j ^ q£ variables, and let L denote the cardinality of U The variable Xi-,j 

is equal to one means the corresponding edge, (i,j), is present in the final model, where our goal is 
to define the factor-graph such that the final model consists of cliques. For this, define the factors as 
follows: 

• Local factor for each variable are equal to the difference between the weighted adjacency and the null 
model if an edge is present (f.e., Xi:j = 1) 


h:jixi:j) = min ( l{Xr, 







(4.4) 


By enforcing the formation of cliques, while minimizing the sum of local factors the negative sum of 
local factors evaluates to modularity (equation (4.2)): 

• For each three edges (i, j), {j, k), {i, k) G S < j < k that form a triangle, define a clique 

factor as 


^{i:j^j:k^i:k}(.Xi:j •) Xi-^ f(T^:j Xj-J^ 2) (4.5) 

These factors ensure the formation of cliques - i.e., if two edges that are adjacent to the same node are 
present (i.e., Xi-.i = 1 and Xi± = 1), the third edge in the triangle should also be present (i.e., Xj-k = 1). 
The computational challenge here is the large number of clique constraints. For the fully connected null 
model, we need 0{N^) such factors and even using the sparse null model - assuming a random edge 
probability a.k.a. Erdos-Reny graph - there are 0{^N^) = 0{^) triangles in the graph (recall that 
L = \£U£^^^^\). 

Brandes et al. [45] first introduced an LP formulation with similar form of constraints. However, 
since they include all the constraints from the beginning and the null model is fully connected, their 
method is only applied to small toy problems. 

^The choice of using square root of weighted degrees for both sampling and weighting is to reduce the variance. One 
may also use pure importance sampling {i.e., use the product of weighted degrees for sampling and set the edge-weights in 
the null model uniformly), or uniform sampling of edges, where the edge-weights of the null model are set to the product 
of weighted degrees. 
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Simplified message update and augmentation 

Here, we give technical details as how to simplify the min-sum BP message update for this factor-graph. 
The clique factor is satisfied only if either zero, one, or all three of the variables in its domain are 
non-zero. Therefore, in order to derive message updates from the clique factor ^{r,jj-,k,v.k} 

to variable x^j for a particular value of Xi:j {e.g., Xi-^j = 0), we apply © operator (i.e., minimize) over 
all the valid cases of incoming messages {e.g., when Xi,k and Xj,k in clique factor ^{i.,jj.,k^r,k} are zero). 
This gives the simplified factor-to-variable messages 

P{i:j,j-.k,i:k}—^i:j{^) min{0, Pj:k—^{i:j,j:k,i:k}^ Pi:k—^{i:j,j:k,i:k}} 

P{i-.j,j-.k,i:k}^i:j{^') min{0, Pj:k^{i:j,j:k,i-.k} T Pi:k^{i:j,j:k,i-.k}} (^'6) 

where for Xi:j = 0, the minimization is over three feasible cases (a) Xj,k = x^k = 0, (b) Xj,k = ^^x^k = 0 
and (c) Xj-,k = = 1. For Xi:j = 1, there are two feasible cases (a) Xj-,k = x^k = 0 and (b) Xj.,k = 

Xi.k !• 

Here we work with normalized messages, such that pi_,.j(0) = 0 ^ and use pi^j to denote pi_>.j(l). 
The same applies to the marginal Pi-j, which is a scalar called bias. Here Pi-j > 0 means Pi:j(l) > Pi:j(0) 
and shows a preference for Xi:j = 1. Normalizing clique-factor messages above we get the following form 
of simplified factor-to-variable messages for clique constraints 

P{i.j,j.k,i:k}—^i:j min{0, Pj-,k—^{i:j,j:k,i:k} T Pi:k—^{i:j,j:k,i:k}} (^■'^) 

min{0, Pj-,k—>{i:j,j:k,i:k}^ Pi:k—>{i:j,j:k,i:k}}- 


In order to deal with large number of factors in this factor-graph, we use the augmentation approach 
of section 2.2.2. We start with no clique-factors, and run min-sum BP to obtain a solution (which may 
even be unfeasible). We then find a set of clique constraints that are violated in the current solution and 
augment the factor-graph with factors to enforce these constraints. In order to find violated constraints 
in the current solution, we simply look at pairs of positively fixed edges {x^j = 1 and x^k = 1) 
around each node i and if the third edge of the triangle is not positively fixed {xj.,k = 0), we add 
the corresponding clique factor {^to the factor-graph. See algorithm 3 (in the appendix) for 
details of our message-passing for Modularity maximization. 

Experiments 

We experimented with a set of classic benchmarks’^. Since the optimization criteria is modularity, we 
compared our method only against best known “modularity optimization” heuristics: (a) FastModularity[67], 

^Note that this is different from the standard normalization for min-sum semiring in which mina,. = 0. 

^Obtained form Mark Newman’s website; http://www-personal.uinich.edu/~mejn/netdata/ 
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Figure 4-4- G^ft) Clustering of power network (N = 4941j by message passing. Different clusters have different 
colors and the nodes are scaled by their degree, (right) Clustering of politician blogs network (N = 1490^ by 
message passing and by meta-data - i.e., liberal or conservative. 


Table 4-1: 


Comparison of different modularity optimization methods. 
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Time 
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Time 
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Time 
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Time 

polbooks 
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5.68% 

0.511 

.07 

3624 

13.55% 

0.506 

.04 

0.525 

1.648 

0.467 

0.179 

0.501 

0.643 

0.489 

0.03 

football 

y 

115 

615 

6554 

27.85% 

0.591 

0.41 

5635 

17.12% 

0.594 

0.14 

0.601 

0.87 

0.487 

0.151 

0.548 

0.08 

0.602 

0.019 

wkarate 

n 

34 

78 

562 

12.34% 

0.431 

0 

431 

15.14% 

0.401 

0 

0.444 

0.557 

0.421 

0.095 

0.410 

0.085 

0.443 

0.027 

netscience 

n 

1589 

2742 

NA 

NA 

NA 

NA 

53027 

.0004% 

0.941 

2.01 

0.907 

8.459 

0.889 

0.303 

0.926 

0.154 

0.948 

0.218 

dolphins 

y 

62 

159 

1892 

14.02% 

0.508 

0.01 

1269 

6.50% 

0.521 

0.01 

0.523 

0.728 

0.491 

0.109 

0.495 

0.107 

0.517 

0.011 

lesmis 

n 

77 

254 

2927 

5.14% 

0.531 

0 

1601 

1.7% 

0.534 

0.01 

0.529 

1.31 

0.483 

0.081 

0.472 

0.073 

0.566 

0.011 

celegansneural 

n 

297 

2359 

43957 

16.70% 

0.391 

10.89 

21380 

3.16% 

0.404 

2.82 

0.406 

5.849 

0.278 

0.188 

0.367 

0.12 

0.435 

0.031 

polblogs 

y 

1490 

19090 

NA 

NA 

NA 

NA 

156753 

.14% 

0.411 

32.75 

0.427 

67.674 

0.425 

0.33 

0.427 

0.305 

0.426 

0.099 

karate 

y 

34 

78 

562 

14.32% 

0.355 

0 

423 

17.54% 

0.390 

0 

0.417 

0.531 

0.393 

0.086 

0.380 

0.079 

0.395 

0.009 
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(b) Louvain [34], (c) Spin-glass [259] and (d) Leading eigenvector [228]. 

table 4.1 summarizes our results (see also Figure 4.4). Here for each method and each data-set, 
we report the time (in seconds) and the Modularity of the communities found by each method. The 
table include the results of message passing for both full and sparse null models, where we used a 
constant a = 20 to generate our stochastic sparse null model. For message passing, we also included 
L = \£ + £;null| saving in the cost using augmentation. This column shows the percentage of 

the number of all the constraints considered by the augmentation. For example, the cost of . 14% for 
the polblogs data-set shows that augmentation and sparse null model meant using .0014 times fewer 
clique-factors, compared to the full factor-graph. 

Overall, the results suggest that our method is comparable to state-of-the-art in terms both time 
and quality of clustering. Although, we should note that the number of triangle constraints in large and 
dense graphs increases very quickly, which deteriorates the performance of this approach despite using 
the augmentation. Despite this fact, our results confirm the utility of augmentation by showing that it 
is able to find feasible solutions using a very small portion of the constraints. 


'‘For message passing, we use A = .1, tmax = median{| — A™‘'‘'|}(ij)g£u£NULL and Tmax = 10. Here we do not perform 
any decimation and directly fix the variables based on their bias Pi-.j > 0 Xi-.j = 1. 




Chapter 5 

Permutations 


5.1 Matching and permanent 

The integration and maximization problems over unrestricted permutations define several important 
combinatorial problems. Two notable examples of integration problems are permanent and deter¬ 
minant of a matrix. Determinant of matrix A G is defined as 

N 

det(A) = sign(x)]^ 

rrSiSjv *=1 

where Sn is the set of all permutations of N elements (a.k.a. symmetric group) and Xi G {1,..., N} 
is the index of element in particular permutation x. Here the sign(.) classifies permutations as even 
(sign(j;) = 1) and odd (sign(g;) = — 1), where we can perform an even (odd) permutation by even (odd) 
number of pairwise exchanges. The only difference in definition of permanent is removal of the sign 
function 

N 

perm (A) = Y H 

xSiSjv *=i 

Here, we see that both permanent and determinant are closely related with two easy combinatorial 
problems on graphs - i.e., perfect matching and spanning sub-tree. While calculating the permanent 
for A G {0, is ^P-hard [297], the determinant can be obtained in 0{N^) [116]. 

The matrix-tree theorem states that the number of spanning trees in a graph with adjacency 
matrix A is equal to det(L(z,z)) for an arbitrary 1 < i < N. Here L = A — D is the Laplacian of A, 
where D is a diagonal matrix with degree of each node on the diagonal (i.e., = Yhj L(i, i) 

is the {N — 1) X (N — 1) sub-matrix of L in which row and column i are removed. 

An intermediate step in representing the permutation as a graphical model is to use a bipartite 
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graph Q = {V = (Vi, V 2 ),), where |Vi| = IV 2 I = N and the edge-set E = {{i,j) \ i G Vi,j G V 2 }. 
A perfect matching is a one to one mapping of elements of Vi to V 2 and can be represented using 
the corresponding edges £' <Z £. It is easy to see that any perfect matching £' <Z £ identifies a 
permutation x. Here the maximum weighted matching (a.k.a. assignment problem) problem is to 
find a perfect matching x* = arg^g^^ max j while the bottleneck assignment problem 

seeks x* = arg^g^^^ min max)^^ ^i,xi ■ The factor-graph representation of the next section shows that 
bipartite matching and bottleneck assignment problems correspond to the max-product (min-sum) and 
min-max inference, and computation of permanent corresponds to sum-product inference over the same 
factor-graph. 

Interestingly min-sum and min-max inference in this setting are in P [121, 182] while sum-product 
is in ^P. Indeed the application of max-product BP to find the maximum weighted matching [22] (and 
its generalization to maximum weighted 6-matching [138]) is one of the few cases in which loopy BP 
is guaranteed to be optimal. Although MCMC methods (section 2.6.1) can provide polynomial time 
approximation schemes for permanent [150] (and many other combinatorial integration problems [149]), 
they are found to be slow in practice [139]. This has motivated approximations using deterministic 
variational techniques [62, 308] and in particular BP [139], which is guaranteed to provide a lower 
bound on the permanent [301]. 

5.1.1 Factor-graph and complexity 

Here we review the factor-graph of Bayati et al. [22] for maximum bipartite matching. Given the 
bipartite graph Q = ((Vi,V 2 ),T) and the associated matrix A G with non-negative entries, 

define two sets of variables x = {x* G V 2 | i G Vi} and z = {zj G Vi \ j G V 2 }, where x* = j and Zj = i 
both mean node i is connected to node j. Obviously this representation is redundant and for (i, j) G £ 
a pairwise factor should ensure Xj and Zi are consistent: 

• Consistency factors ensure the consistency of x and z 

f{jj}(xi, Zj) = l((xi = j A Zj =i)\/ {xi 7^ J A Zj 7^ i)) y{i,j) G £ 

• Local factors represent the cost of a matching 

G Vl 


where if i and j are connected in a matching, two local factors f{j}(xj) = A*j and f|j}.(xi) = y^A^ 

account for Ajj. Figure 5.1 shows this factor-graph. 

It is easy to see that the joint form q(x, z) is equal to for any consistent assignment 

© 

to X, z and it is equal to 1 otherwise. Therefore sum-product and max-product (i.e., 
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Figure 5.1: The factor-graph for matching where the local factors are black and consistency factors are white 
squares. 

q(x, z)) produce the permanent and max-weighted matching respectively. 

The cost of each iteration of both max-product and sum-product BP in the factor-graph above is 
0{N‘^). Moreover, for max-product BP, if the optimal solution is unique, BP is guaranteed to converge 
to this solution after 0{^^) iterations, where e is the difference between the cost of first and second 
best matchings and y* is the cost of best matching [22]. 

An alternative is to use a binary variable model in which each edge of the bipartite graph is 
associated with a binary variable and replace the consistency factors with degree constraint to ensure 
that each node is matched to exactly one other node. However this model results in BP updates 
equivalent to the one above (the simplification of updates discussed in [25] is exactly the updates of 
binary variable model). 

For min-max semiring, q(x, z) = max)^;^ ^i,Xi for a consistent assignment to x,z and it evaluates to 

min 

1 = 00 . Therefore min-max inference here seeks an assignment that minimizes the maximum matching 
cost - a.k.a. bottleneck assignment problem. 

5.1.2 Arbitrary graphs 

We can also use message passing to solve max-weighted matching in an arbitrary graph Q = (V,T) 
with adjacency matrix A. Zdeborova and Mezard [331] proposed a factor-graph for the related task of 
counting of the perfect matchings in an arbitrary graph. For this, each edge (i, j) G T is assigned to one 
binary variable xpj G {0,1} and the degree factors on each node restrict the number of non-zero values 
to one 


= 1 (( < 1 ) 
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where S{i,.) is the set of all the edges adjacent to node i. 

Sanghavi [273] consider the problem of maximum weighted b-matching in arbitrary graphs by chang¬ 
ing the degree factor to 


= 1(( Xi-.j) < b) 

and also local factor that takes the weights into account. They show that if the 

solution to the corresponding LP relaxation is integral then BP converges to the optimal solution. 
Moreover, BP does not converge if the LP solution is not integral. 

Matchings in an arbitrary graph Q is also related to the permanent and also (vertex disjoint) cycle 
covers; a set of directed cycles in Q that cover all of its nodes exactly once. The number of such cycle 
covers in an un-weighted graph is equal to perm (A), which is in turn equal to the square of number of 
perfect matchings [222, ch. 13]. 

In fact a directed cycle cover with maximum weight is equivalent to the maximum weighted bipartite 
matching in the construction of the previous section. In section 5.2 below we will use message passing 
to obtain a minimum weighted “undirected” cycle cover and further restrict these covers to obtain a 
minimum weighted cover with a single cycle - i.e., a minimum tour for TSP. 

5.2 Traveling salesman problem 

A Traveling Salesman Problem (TSP) seeks the minimum length tour of N cities that visits each city 
exactly once. TSP is NP-hard and for general distances, no constant factor approximation to this 
problem is possible [238]. The best known exact solver, due to Held and Karp [128], uses dynamic 
programming to reduce the cost of enumerating all orderings from 0{NI) to 0{N‘^2^). The devel¬ 
opment of many (now) standard optimization techniques are closely linked with advances in solving 
TSP. Important examples are simulated annealing [56, 166], mixed integer linear programming [118], 
dynamic programming [28], ant colony optimization [85] and genetic algorithms [115, 120]. 

Since Dantzig et al. [77] manually applied the cutting plane method to 49-city problem, a combination 
of more sophisticated cuts, used with branch-and-bound techniques [21, 233] has produced the state-of- 
the-art TSP-solver, Concorde [13]. Other notable results on very large instances have been reported 
by Lin-Kernigan heuristic [130] that continuously improves a solution by exchanging nodes in the tour. 
For a readable historical background of the state-of-the-art in TSP and its applications, see [12]. 

The search over the optimal tour is a search over all permutations of N cities that contains no 
sub-tours - that is the permutation/tour is constrained such that we dot not return to the starting 
city without visiting all other cities. Producing the permutation with minimum cost that may include 
sub-tours is called the (vertex disjoint) cycle-cover and is in P (see section 5.1.2). 

We provide two approaches to model TSP: section 5.2.1 presents the first approach, which ignores the 
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subtour constraints - i.e., finds cycle covers - and then “augment” the factor-graph with such constraints 
when they become violated. This augmentation process is repeated until a feasible solution is found. 
The second approach, presented in section 5.2.2, is to use the variables that represent the time-step in 
which a node is visited. By having the same number of time-steps as cities, the subtour constraint is 
automatically enforced. This second formulation, which is computationally more expensive, is closely 
related to our factor-graph for sub-graph isomorphism (see section 5.3). This is because one can think 
of the problem of finding a Hamiltonian cycle in Q as finding a sub-graph of G that is isomorphic to 
a loop of size | V|. 

5.2.1 Augmentative approach 

Let G = (V,T) denote a graph of our problem with the positively weighted symmetric adjacency matrix 
A, s.t. Ajj = 0 {i,j) ^ S. The objective is to select a subset of S that identifies shortest tour of N 

cities. Let x = {xij \ {i,j) G 6} he a set of M binary variables {i.e., Xi-,j G {0,1}), one for each edge in 
the graph (i.e., M = |T|) where Xi-,j = 1 means (i,j) is in the tour. We use Xij and Xj:i to refer to the 
same variable. Recall that for each node i, T(i, •) denotes the edges adjacent to i. Define the factors of 
the factor-graph as follows 

• Local factors represent the cost associated with each edge 

h-.jixi:j) = min (l(xi:j = 0), Ajj) V(i,j)GT (5.1) 


where h-.j{xi:j) is either Ajj or zero. 

Any valid tour satisfies the following necessary and sufficient constraints - a.k.a. Held-Karp con¬ 
straints [127]: 

• Degree factors ensure that exactly two edges that are adjacent to each vertex are in the tour 

= 1(( Xi.,j) = 2) (5.2) 

• Subtour factors ensure that there are no short-circuits - i.e., there are no loops that contain strict 

def 

subsets of nodes. To enforce this, for each 5 C V, define S{S ,.) = {(i,j) G £ \ i G S,j ^ S} to he the 
set of edges, with one end in S and the other end in V \ 5. We need to have at least two edges leaving 
each subset S. The following set of factors enforce these constraints 

%(5,.)fe(5,)) = 1(( E V5CV, 5/0 (5.3) 

These three types of factors define a factor-graph, whose minimum energy configuration is the smallest 
tour for TSP. Therefore we can use min-sum inference to obtain the optimal tour. Note that both 
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subtour and degree constraints depend on large number of variables, however, due to sparsity they 
allow efficient linear time calculation of factor-to-variable messages; see equation (5.3). The more 
significant computational challenge is that the complete TSP factor-graph has 0(2^) subtour factors, 
one for each subset of variables. In hgure 5.2 we address this problem using factor-graph augmentation. 

Simplified messages 

In section 2.2.1, we introduced the K-of-N factors for min-sum inference. Both degree and subtour 
factors are different variations of this types of factor. For simplicity we work with normalized message 
= pi_>.j:j(l) — pi_5.j:j(0), which is equivalent to assuming pi_,.j,j(0) = 0 VI, z : j G dl. The same 
notation is used for variable-to-factor message, and marginal belief. As before, we refer to the normalized 
marginal belief, Pij = p(xi:j = 1) — p(xi:j = 0) as bias. 

Recall that a degree constraint for node i (f£:(j,.)) depends on all the variables xt-j for edges {i,j) 
that are adjacent to i. Here we review the factor-to-variable message for min-sum BP 

P£ii,-)^r.j{Xi:j) = min !£(*,.) (^(i,.)) (5.4) 

We show that this message update simplifies to 

P£(L-)^Li(l) = min (pj,fc^£(i_.) I (i,/c) G £:(i, •) \ (b j)) Vz G V (5.5) 

P£:(L-)^Li(0) = ^^Tx{pi:k^£(i^.)+ Pi:i^£{i,.) \ {i,k),{i,l) e S{i,-)\{i,j)) (5.6) 

where for Xi-j = 1, in order to satisfy degree constraint f£:(i,.)(T£:(i_.)), only one other Xi:k for (z,fc) G 
\ {i,j) should be non-zero. On the other hand, we know that messages are normalized such 
that Pi:j-).£’(i,.)(0) = 0, which means they can be ignored in the summation of equation (5.4). For 
Xi-.j = 0, in order to satisfy the constraint factor, two of the adjacent variables should have a non-zero 
value. Therefore we seek two such incoming messages with minimum values. Let min^ A denote the 
smallest value in the set A - i.e., min A = min^ A. We combine the updates above to get a “normalized 
message”, P£(i^.)-^i-.j, which is simply the negative of the second largest incoming message (excluding 
P£{i,-)^i:j) to the degree factor f£(^i^.y. 

2 

P£(i,.)^r.j = P£y^.)^r.j{l) (0) = “ min{pi,fc^^(j_.) | {i,k) G £{i,-) \ (bj)} (5-7) 

Following a similar procedure, factor-to-variable messages for subtour factors is given by 

P£(S,-)^i:j = - max ^0, min{pi,fc^£:( 5 _.) | (z, k) e £ (5, •) \ (z, j)}^ (5.8) 

While we are searching for the minimum incoming message, if we encounter two messages with 
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negative or zero values, we can safely assume P£(s^.)^r,j = 0, and stop the search. This results in 
signihcant speedup in practice. Note that both equation (5.7) and equation (5.8) only need to calculate 
the second smallest incoming message to their corresponding factors, less the current outgoing message. 
In the asynchronous calculation of messages, this minimization should be repeated for each outgoing 
message. However in a factor-synchronous update, by finding three smallest incoming messages to each 
factor, we can calculate all the factor-to-variable messages at the same time. 



□ 
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□ 

iZJ 



iteration 11 


Figure 5.2: The message passing results after each augmentation step for the complete graph of printing board 
instance from [260]. The blue lines in each figure show the selected edges at the end of message passing. The pale 
red lines show the edges with the bias that, although negative (pi-,j < 0), were close to zero. 


Augmentation 

To deal with the exponentially large number of subtour factors, we use the augmentation procedure of 
section 2.2.2. Starting with a factor-graph with no subtour factor, we find a solution using min-sum 
BP. If the solution is feasible (has no subtours) we are done. Otherwise, we can find all subtours in 0{N) 
by finding connected components. We identify all the variables in each subtour as 5 C V and add a 
subtour factor f£’( 5 ,.) to ensure that this constraint is satisfied in the next iteration of augmentation. Here 
to speed up the message passing we reuse the messages from the previous augmentation step. Moreover 
in obtaining x* from min-sum BP marginals p(xj), we ensure that no degree constraint is violated 
(i.e., each node has two neighbouring edges in x*). Figure 5.2 shows iterations of augmentation over a 
print board TSP instance from [260] . Algorithm 4 in the appendix gives details of our message passing 
solution to TSP, which also uses several other minor tricks to speed up the BP message updates. 
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TSPLIB Instances 



Random 2D Euclidean Distance 
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Figure 5.3: Results of message passing for TSP on different benchmark problems. From left to right, the 
plots show: (a) running time, (b) optimality ratio (compared to Concorde), (c) iterations of augmentation and 
(d) number of subtours constraints - all as a function of number of nodes. The optimality is relative to the result 
reported by Concorde. Note that all plots except optimality are log-log plots where a linear trend shows a monomial 
relation (y = ax™') between the values on the x and y axis, where the slope shows the power m. 
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Experiments 

Here we evaluate our method over five benchmark datasets^: (I) TSPLIB, which contains a variety of 
real-world benchmark instances, the majority of which are 2D or 3D Euclidean or geographic distances.^ 
(11) Euclidean distance between random points in 2D. (Ill) Random (symmetric) distance matrices. 
(IV) Hamming distance between random binary vectors with fixed length (20 bits). This appears in 
applications such as data compression [153] and radiation hybrid mapping in genomics [29]. (V) Corre¬ 
lation distance between random vectors with 5 random features (e.g., using TSP for gene co-clustering 
[69]). In producing random points and features as well as random distances (in (HI)), we used uniform 
distribution over [0,1]. 

For each of these cases, we report the (a) run-time, (b) optimality, (c) number of iterations of 
augmentation and (d) number of subtour factors at the final iteration. In all of the experiments, we 
use Concorde [13] with its default settings to obtain the optimal solution.'^ The results in figure 5.3 
(2nd column from left) reports the optimality ratio - i.e., ratio of the tour found by message passing, 
to the optimal tour. To demonstrate the non-triviality of these instance, we also report the optimality 
ratio for two heuristics that have (1 -|- [log2(V)])/2- optimality guarantees for metric instances [154]: 
(a) nearest neighbour heuristic (0(V^)), which incrementally adds to any end of the current path the 
closest city that does not form a loop; (b) greedy algorithm (0(V^ log(V))), which incrementally adds 
a lowest cost edge to the current edge-set, while avoiding subtours. 

All the plots in figure 5.3, except for the second column, are in log-log format. When using log-log 
plot, a linear trend shows a monomial relation between x and y axes - i.e., y = ax™. Here m indicates 
the slope of the line in the plot and the intercept corresponds to log(a). By studying the slope of the 
linear trend in the run-time (left column) in figure 5.3, we observe that, for almost all instances, message 
passing seems to grow with {i.e., slope of ~ 3). Exceptions are TSPLIB instances, which seem to 
pose a greater challenge, and random distance matrices which seem to be easier for message passing. A 
similar trend is suggested by the number of subtour factors and iterations of augmentation, which has 
a slope of ~ 1, suggesting a linear dependence on N. Again the exceptions are TSPLIB instances that 
grow faster than N and random distance matrices that seem to grow sub-linearly. 

Overall, we observe that augmentative message-passing is able to find near-optimal solutions in poly¬ 
nomial time. Although powerful branch-and-cut methods, such as Concorde, are able to exactly solve 
instances with several thousands of variables, their general run-time on random benchmark instances 

^ In all experiments, we used the full graph Q — iy,£), which means each iteration of message passing is 0{N^t), where 
T is the number of subtour factors. All experiments use Tmax = 200 iterations, tmax = medianfAij | i,j} and damping with 
A = .2. We used decimation, and fixed 10% of the remaining variables (out of N) per iteration of decimation. Note that 
here we are only fixing the top N variables with positive bias. The remaining M — N variables are automatically clamped 
to zero. This increases the cost of message passing by an 0{\og{N)) multiplicative factor, however it often produces better 
results. 

^Geographic distance is the distance on the surface of the earth as a large sphere. 

®For many larger instances, Concorde (with default setting and using CPLEX as LP solver) was not able to find the 
optimal solution. Nevertheless we used the upper bound on the optimal produced by Concord in evaluating our method. 
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remains exponential [13, p495], while our approximation on random instances appears to be 0{N^). 

5.2.2 Using pairwise factors 

Here we present an alternative factor-graph for finding permutations without subtours. This formulation 
has O(iV^) factors and therefore the complete factor-graph remains tractable. However in practice, the 
min-sum inference over this factor-graph is not as effective as the augmentation approach. Therefore, 
here we use this factor-graph to solve min-max version of TSP, known as bottleneck TSP through 
sum-product reductions.^ 

Given an asymmetric distance matrix A G , the task in the Bottleneck Traveling Salesman 

Problem (BTSP) is to find a tour of all N points such that the maximum distance between two consec¬ 
utive cities in the tour is minimized [156]. Any constant-factor approximation for arbitrary instances 
of this problem is NP-hard [242]. 

Let X = {xi,..., xn} denote the set of variables where Xi G Aj = {0,..., A — 1} represents the 
time-step at which node i is visited. We assume modular arithmetic (module N) on members of Aj - 
e.g., N = 0 mod N and 1 — 2 = A — 1 mod A. For each pair Xi and xj of variables, define the factor 

= min ^l(lxi - Xj\ > l),max(Ajj, l{xi = Xj - 1)), max(Aj,i, l{xj = Xi - 1))^ (5.9) 

where the tabular form is 



0 

1 


N -2 

N -1 


0 

OO 


— 00 


— oo 

— 00 


1 

A • - 

00 

A • - 


— oo 

— 00 

— 00 

2 

— OO 

A • - 

00 


— oo 

—oo 

—oo 









N -3 

— 00 

—oo 

—oo 


00 

A • • 

— 00 

N -2 

— oo 

—oo 

—oo 


^ 3 ,i 

00 

^i ,3 

N -1 

A - • 

— 00 

— 00 


— oo 

A • • 

00 


Here, 1 = oo on diagonal enteries ensures that Xi ^ Xj. Moreover \xi — Xj\ > 1 means cities i and 

0 

j are not visited consecutively, so this factor has no effect (1 = —oo). However if two cities are visited 
one after the other, depending on whether i was visited before j or vicee-versa, Ajj or Aj^i represent 
the distance between them. This factor can be easily converted to min-sum domains by replacing 

'^A binary-variable formulation with similar symantics is proposed in [ 30 (i]. However the authors applied their message 
passing solution to an instance with only five cities. 
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the identity values — oo,+(X) with —00,0 in the tabular form above. Alternatively we can replace the 
identity, min and max operations in equation (5.9) with their corresponding min-sum functions. 

Here we relate the min-max factor-graph above to a uniform distribution over Hamiltonian cycles 
and use its sum-product reduction to find solutions to Bottleneck TSP. Recall Q{A,y) is a graph in 
which there is a connection between node i and j iff Aij < y. 

Proposition 5.2.1. For any distance matrix A G , the Py-reduction of the BTSP factor-graph 

above, defines a uniform distribution over the (directed) Hamiltonian cycles of Q{A,y). 

Proof. First note that Py defines a uniform distribution over its support as its unnormalized value is 
only zero or one. Here w.l.o.g we distinguish between two Hamiltonian cycles that have a different 
starting point but otherwise represent the same tour. Consider the Py-reduction of the pairwise factor 
of equation (5.9) 


= l{\xi - Xj\ > 1) -b l{xi = Xj-lA Aij < y) (5.10) 

-b l{xi = Xj + l A Aj^i < y) (5.11) 

• Every Hamiltonian cycle over Q{A,y), defines a unique assignments x with Py{x) > 0; Given the 
Hamiltonian cycle H = ho, ^ 2 ,..., h^-i where hi G {1,..., is the i^^ node in the path, for each 
i define Xi = j s.t. hj = i. Now we show that all pairwise factors of equation (5.10) are non-zero 
for X. Consider two variables Xi and Xj. If they are not consecutive in the Hamiltonian cycle then 
^^ij}{xi,Xj) = l(|xi — Xj\ > 1) > 0. Now w.l.o.g. assume i and j are consecutive and xt appears 
before Xj. This means (i,j) G S and therefore Ajj < y, which in turn means j(xj,rcj) = l(xj = 
Xj — 1 A Aij < y) > 0 Since all pairwise factors are non-zero, Py{x) > 0. 

• Every x for which Py{x) > 0, defines a unique Hamiltonian path over Q(A,y): Given assignment 

X, construct H = ho,..., h^^i where hi = j s.t.Xj = i. Now we show that if p(x) > 0, H defines 
a Hamiltonian path. If p(x) > 0, for every two variables Xi and Xi, one of the indicator functions of 
equation (5.10) should evaluate to one. This means that first of all, Xi ^ Xj for i ^ j, which implies H 
is well-defined and hi ^ hj for i ^ j. Since all x* G {0,..., A — 1} values are distinct, for each Xi = s 
there are two variables Xj = s — 1 and Xfc = s -b 1 (recall that we are using modular arithmetic) for which 
the pairwise factor of equation (5.10) is non-zero. This means Ajj < y and Ai^k < V and therefore 
{j,i), {i,k) G S (the edge-set of Q{A,y)). But by dehnition of H, hg = i, hg-i = j and hg+i = k are 
consecutive nodes in H and therefore A is a Hamiltonian path. □ 

This proposition implies that we can use the sum-product reduction of this factor-graph to solve 
Hamiltonian cycle problems. The resulting factor-graph for Hamiltonian cycle problem is an special 
case of our graph-on-graph technique, where the adjacency matrix of one graph is directly used to build 
factors in a second graph. Here, as the tabular form of the factor above suggests, the second graph is a 
simple loop of length N (see section 5.3). 
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Figure 5.4-' The min-max solution (using sum-product reduction) for Bottleneck TSP with different number of 
cities (x-axis) for 2D Euclidean space (left) as well as asymmetric random distance matrices (right) with T = 5000 
for Perturbed BP. The error-bars in all figures show one standard deviation over 10 random instances. 


As expected, due to sparse form of this pairwise factor, we can perform BP updates efficiently. 

Claim 5.2.2. The factor-to-variable BP messages for the sum-product reduction of factor of equa¬ 
tion (5.9) can be obtained in 0{N). 

Proof. The p^-reduction of the min-max factors of equation (5.9) is given by: 


^{i,j}{xi,Xj) = < y) (5.12) 

= l{\xi - Xjl > 1) -L l{xi = Xj - I f\ Aij < y) (5.13) 

-P l{xi = Xj -\-1 A Aj^i < y) (5.14) 

The matrix-form of this factor (depending on the order of Aij, Aj^i, y) takes several forms all of which 
are band-limited. Assuming the variable-to-factor messages are normalized (ie., Pj_^i(xj) = 1) the 
factor-to-variable message is given by 


P{i,j}^i(.Xi) 1 j j (3^1 )T 

l(Aij < y){l -Pj^{ij}{xi - 1))-L 

l(Aj-i < 2/)(l - Pj^{ij}{xi -L 1)) 

Therefore the cost of calculating factor-to-variable message is that of normalizing the variable-to- 
factor message, which is 0{N). □ 

Since there are iV^ pairwise factors, this gives 0{N^) time-complexity for each iteration of sum- 
product BP in solving the Hamiltonian cycle problem (i.e., the sum-product reduction) and 0{N^ log(A^)) 
for bottleneck-TSP, where the log(A^) factor is the cost of binary search (see equation (1.25)). 
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Figure 5.4 reports the average performance of message passing (over 10 instances) as well as a lower 
bound on the optimal min-max value for tours of different length (N). Here we report the results for 
random points in 2D Euclidean space as well as asymmetric random distance matrices. For Euclidean 
problems, the lower bound is the maximum over j of the distance of two closest neighbors to each 
node j. For the asymmetric random distance matrices, the maximum is over all the minimum length 
incoming edges and minimum length outgoing edges for each node. 


5.3 Graph matching problems 

Consider two graphs G = (V,<S) and Q' = with (weighted) adjacency matrices A, A' respec¬ 

tively. Here we enumerate several of the most important problems over permutations based on two 
graphs [72]. Section 5.3.1 introduces the graphical model for the problems of (sub-graph) isomorphism 
and monomorphism. Then we study the message passing solution to graph homomorphism, use it to 
find symmetries in graphs. In section 5.3.5 we introduce a general factor-graph for graph alignment 
based on the previous work of Bradde et al. [44] and show how it can model other problems such as 
quadratic assignment problem and maximum common subgraph. The common idea in all these settings 
is that a “variation” of the adjacency matrix A' of graph G' is used as a pairwise factor over the edges 
(or/and non-edges) of graph G (with adjacency A), defining a Markov network (a factor-graph with 
pairwise factors). 

5.3.1 Sub-graph isomorphism 

The graph isomorphism problem asks whether G = (V,T) and G' = (y',£') are identical up to a 
permutation of nodes (written as ^ = G')- That is, it seeks a one-to-one mapping vr : V —)• V' such 
that {i,j) G T (7r(z), 7r(j)) G S'. With some abuse of notation we also write Tr(G) = G'■ Although, 
there have been polynomial time solutions to special instances of graph isomorphism problem [95], the 
general case has remained elusive. So much so that we do not know whether the problem is NP-complete 
(e.^., see [222]). 

A permutation vr G (recall is the symmetric group) such that 7r(G) = G is called an au¬ 
tomorphism. The automorphisms of G, under composition form a group, called the automorphism 
group Aut(^). The automorphism group also defines a natural notion of symmetry on the nodes of 
graphs. Here, the orbit of each node, is the set of nodes that are mapped to i in any automorphism - 

def 

i.e., orbit(z) = {Tr{i) \ vr G Aut(^)}. The orbits partition the set of nodes V into group of nodes that are 
in a sense symmetric, which makes them a prime candidate for defining symmetry in complex networks 
[198, 321]. 

for one node the minimum length in-coming and out-going edges point to the same city, the second minimum length 
in-coming and out-going edges are also considered in calculating a tighter bound. 
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Sub-graph isomorphism asks whether Q' is isomorphic to a vertex induced subgraph of - 
i.e., it seeks an injective mapping vr : V' —)• V where 

(7r(i),7r(j)) G ii,j)e£' \fi,j eV 

When dealing with sub-graph morphisms, we assume that the mapping is from the smaller graph to the 
larger graph and therefore |V| < |V'|. 

The factor-graph for subgraph isomorphism is defined as follows: We have one variable per i G V, 
where the domain of each variable Xi is V - i.e., x = {xi G | i G V}. The factor-graph has two types 
of pairwise factors: 

• Edge factors: ensure that each edge in ^ = (V,T) is mapped to an edge in Q' = {V',£') 

^ (5.15) 

where assuming the tabular form of this factor for sum-product semiring is simply the adjacency matrix 
A' G oiG'. 

• Non-edge factors: ensure that each non-edge in ^ = (V,T) is mapped to a non-edge in G' = {V',£') 

= l{{xi,Xj) ^ T' Axi / Xj) \/i,j G {i,j) ^ £ (5.16) 

where again for sum-product semiring, the tabular form takes a simple form w.r.t. the binary valued 
adjacency matrix of G' - i-e., 1 — A'. Using sum-product semiring, this fully connected Markov network 
defines a uniform distribution over the (subgraph) isomorphisms from G to G' ■ We could use sum- 
product BP with decimation or perturbed BP to sample individual assignments x* ~ p(x). Here, each 
assignment x = vr is an (injective) mapping from V to V', where x* = j' means node i G V is mapped 
to node j' G V'. 

In particular, for G = G\ the integral is equal to the cardinality of the automorphism group q(0) = 
|Aut(^)| and two nodes i,j G V are in the same orbit iff the have the same marginals - i.e., 

p(xj) = p{xj) 4A i G orbit(j) 

This also suggests a procedure for finding (approximate) symmetries in graphs. We can use sum-product 
BP to find marginals and group the nodes based on the similarity of their marginals. However, the cost 
of message passing in this graphical model is an important barrier in practice. 

Claim 5.3.1. Assuming |V| < \£\ < |Vp and |V'| < \£'\ < and using variable synchronous update, 
the time complexity of each iteration of sum-product BP for subgraph isomorphism is 0(|Vp \£'\). 

Proof. First, we calculate the cost of sending sum-product BP messages through the edge and non-edge 
factors. For the edge factors (A'), we can go through each row of the tabular form of the factor and multi- 
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ply the non-zero entries with corresponding message - he., g{ij}{xi, Xj) = Xj). Then 

we add the values in each column, to obtain the outgoing message - i.e., P{ijj-^j{xj) = Ylxi ^j)- 

This procedure depends on the number of non-zero entries - he., 0{\£'\). The procedure is similar for 
non-edge factors. Therefore the overall cost of sending sum-product BP messages through all 0(|Vp) 
factors is 0{\£'\ |Vp). Using variable synchronous update, calculating all variable-to-factor messages 
takes C>(|Vp|V'|). Using the assumption of the claim, the overall cost is therefore 0{\£'\ |Vp). □ 

However it is possible to improve this complexity by considering sparse mappings. For example, we 
can restrict the domain of each variable Xi to all the nodes j' G V' that have the same degree with node 
i, or furthermore to all the nodes that also have neighbours with same degree as the neighbours of node 
i. 

5.3.2 Subgraph monomorphism and supermorphism 

Sub-graph monomorphism relaxes the constraint of the subgraph isomorphism to 

ii,j)€£ => (7r(i),7r(j)) G T' Vi, j G V (5.17) 

where vr : —)• V has to be injective - i.e., nodes in V' are mapped to distinct nodes in V. However, Q' is 

allowed to cover a “subset” of edges in an induced subgraph of ff. We note here that previous graphical 
models introduce in [43, 44] fox isomorphism in fact define monomorphism. The only difference between 
this factor-graph and that of sub-graph isomorphism is that the non-edge factors are replaced by the 
following uniqueness factors. 

• Uniqueness factors are inverse Potts factors that ensure disconnected nodes are mapped to different 
nodes (- i.e., the mapping is injective) 

/ U) Vi.j G (i,j) (5.18) 

Despite this difference, when Q = Q' - that is when we are interested in automorphisms - or more 
generally when \£\ = \£'\ and |V| = |V'|, the distribution defined by the monomorphism factor-graph is 
identical to that of isomorphism factor-graph. 

Claim 5.3.2. Assuming |V| < \£\ < |Vp and |V'| < \£'\ < \V'\^, using variable synchronous update, 
the time complexity of each iteration of sum-product BP for sub-graph monomorphism 0(|T| \£'\ + 
|V'| |Vp).« 

Proof. The complexity of sending sum-product BP messages through the edge factors is 0(\£'\) (see 
proof of Claim 5.3.1). However, the uniqueness factors are inverse Potts factors and allow 0(|V^|) 

®Bradde et al. [ 44 ] suggest a trick to reduce this time-complexity to (h((|£l| |V|)|Vj), but unfortunately details are 
omitted and we are unable to follow their route. 
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calculation of messages. This means the overall cost of sending messages through all the factors is 
0{\£\ \£'\ + |V'| |Vp). The cost of calculating variable-to-factor messages using variable-synchronous 
update is also 0(|V|^|V'|), which gives the overall complexity of 0{\£\ \£'\ + |V'| |Vp) per BP iteration. 

□ 

So far we have seen two variations of mapping a graph ^ to a subgraph of Q'. In subgraph isomor¬ 
phism, the image of Q strictly agrees with a sub-graph of Q'. In monomorphism, the image is contained 
within a sub-graph. A third possibility is to ask for a mapping such that the image of Q “contains” a 
sub-graph Q'\ 


{i,j)^£ <= (7r(i),7r(j)) E T' Vi, j E V (5.19) 

where again tt : —)• V has to be injective. Due to lack of a better name, we call this mapping 

subgraph supermorphism. 

The factor-graph for sub-graph supermorphism has two types of factors, both of which we have seen 
before: 1) The non-edge factors between non-edges (equation (5.16)) ensure that the image contains 
a sub-graph of Q' while 2) uniqueness factors between the edges (i,j) E £ ensure that the mapping is 
injective. 

Claim 5.3.3. Assuming |V| < \£\ < |Vp and |V'| < \£'\ < |V'P, using variable synchronous update, the 
time complexity of each iteration of sum-product BP for subgraph supermorphism is 0{{\V‘^\ — \£\) \£'\ + 

iv'i ivn 

Proof. The complexity of sending sum-product BP messages through the non-edge factors is 0(\£'\) 
(see proof of Claim 5.3.1) and each uniqueness factor require 0(|V'|) computation. This means the 
overall cost of sending messages through all the factors is 0((|V^| — |T|) \£'\ -\- |V'| |T|). The cost 
of calculating variable-to-factor messages using variable-synchronous update is also C>(|Vp|V'|), which 
gives the overall complexity of C>((|V^| — |T|) \£'\ -|- |V'| |Vp) per BP iteration. □ 

5.3.3 Graph homomorphism 

Graph homomorphism [129] further relaxes the constraint of the sub-graph monomorphism such that 
a homomorphic mapping can map two distinct nodes in V to the same node in However, equa¬ 
tion (5.17) should hold, and therefore if two nodes in V' are adjacent, they should still be mapped to 
distinct and adjacent nodes in V. Compared to other areas of graph theory, the study of rich structure 
and surprising properties of homomorphisms is relatively recent. Study of graph homomorphisms covers 
diverse areas including property testing [9], graph sequences and limits [37, 39, 40, 197] and constraint 
satisfaction problems [223]. Many of these areas are interested in “counting” the number of homomor¬ 
phisms [38] from a graph Q io Q'. As we see shortly, graph homomorphism reduces to many interesting 
CSPs and therefore it is not hard to show that it is NP-complete. Moreover, counting the number of 
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Figure 5 . 5 : (left) The number of endomorphisms for all distinct graphs up to 8 nodes compared to the BP 
integral. Here, larger disks represent graphs with smaller number of nodes, (right) Comparison of normalized 
BP marginals and the exact marginals for endomorphism graphical model. 

homomorphisms is #P-complete [86]. The set of all homomorphisms vr of a graph Q to itself -i.e., its 
endomorphisms - under composition form endomorphism monoid (see definition 1.1.1 for monoid). 
Here, the identity element simply maps each element to itself. Our interest in endomorphism is because 
through the conjecture 5.3.4, we can use it (instead of automorphism) to approximate symmetries in 
graphs. 

The graphical model representation of homomorphism has been investigated in different contexts [50, 
52], but to our knowledge, message passing has not been previously used for counting and finding ho¬ 
momorphisms. The Markov network for homomorphism resembles that of isomorphism and monomor¬ 
phism: The variables are x = {xi \ i € V} where Xi G V', and the factor-graph only contains edge-factors 
of equation (5.15). Assuming jV'J < \£'\ and jV] < \£\, it is easy to see that the complexity of variable- 
synchronous message passing is OdT] JT^), which makes this method very efficient for sparse graphs. 
This graphical model can be extended to represent homomorphism in weighted graphs [38]. For this 
define the edge-factors as 

hid }j e V, i / j. A,,,- > 0 

For small graphs we can exactly count the number of homomorphisms and endomorphisms. To 
evaluate the accuracy of message passing, we compared the BP estimates with the exact number of 
endomorphisms, for all isomorphically distinct graphs with jV] < 9 (i.e., > 13,000 instances); Figure 5.5 
reports this result as well as the accuracy of BP marginals, suggesting that despite existence of short 
loops in these graphs BP estimates are relatively accurate. 
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Reduction to other CSPs 

The relation between homomorphism and other CSPs is well-known [223] . Here, we review this relation 
between the factor-graphs of this section and other CSP factor-graphs we encountered in this thesis. 
Coloring and clique cover: (see example 3.0.1 and section 3.4) K-coloring of Q corresponds to finding 
a homomorphism from Q to JCk, the complete graph of order K. Similarly the homomorphism from the 
complement of Q to IC^ corresponds to K-clique-cover. The relation is also reflected in the corresponding 
factor-graphs as the adjacency matrix for fCx is the inverse Potts model, used in K-coloring. 

Clique problem and independent-set: (see section 3.6) Recall that K-clique problem corresponds 
to finding a clique of size K in Q. This is a special case of sub-graph isomorphism from JCk to G 
which is in this case identical to sub-graph monomorphism and homomorphism from JCk to Q. The 
(sum-product reduction of the) categorical-variable model of section 3.6.2 is a fully connected Markov 
network with edge factors that we used for isomorphism, monomorphism and homomorphism equa¬ 
tion (5.15). Similarly, the K-independent set problem is equivalent to finding homomorphisms from JCk 
to the complement of G. Independent set has an alternative relation with graph homomorphism: any 
homomorphism from ^ to a graph with two connected nodes and a self-loop on one of the nodes defines 
an independent set for G- 

Hamiltonian cycle problem corresponds to sub-graph monomorphism from C|v|, the cycle of length | V|, 
to G- Alternatively, we can formulate it as subgraph supermorphism from G to C|v|. The sum-product 
reduction of our min-max formulation for bottleneck TSP in section 5.2.2 is indeed the factor-graph of 
sub-graph supermorphism from G to C|v| • 

5.3.4 Finding symmetries 

One may characterize a graph G using the “number” of homomorphism from/to other graphs G' ■ This 
characterization is behind the application of graph homomorphism in property testing and definition 
of graph sequences. Let Hom(H,^) be the set of homomorphism from Ti to G - i-e., the set of all 
assignments x where p(x) > 0. Let Hi ,..., Hm be the sequence of all graphs whose number of nodes is 
at most jVj. Then, the Lovasz vector of G which is defined as 

y(^) = (|Hom(?^i,^)|, |Hom(?^ 2 ,^)|, • • •, |Hom(?^M,^)|) (5.20) 

uniquely identifies G up to an isomorphism [196]. 

Here, rather than identifying a particular graph G within the set of all graphs, we are interested 
in identifying a node i G V of a single graph G within the set of all nodes V. Note that here both 
identifications are up to an isomorphism. Our objective is equivalent to finding the orbits of G and our 
approach in finding the orbits using graph homomorphism (rather than isomorphism) is founded on the 
following conjecture. 
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Figure 5.6: The table on the right shows the unnormalized endomorphism marginals for the graph on the left. 
Here row i corresponds to q(a:i) (normalization of which gives p{xi)), and the number at row i and column j is 
the number of times node i is mapped to j in an endomorphism. The total number of endomorphisms for this 
graph is q(0) = 78. Here, the orbits are {2,3}, {5,6}, {1, 7}, {4}. However, q(xi) = q(a; 7 ) = q(a; 4 ) - that is node 
4 maps “to” other nodes with the same freguency as nodes 1 and 7. However, the mappings to node 4 (i.e., the 
4‘^ column of the table) remains different from the mappings to 1 and 7. Here, as predicted by conjecture 5.3.4, 
nodes with similar rows and columns belong to the same orbit. 


Conjecture 5.3.4. Given the uniform distribution over the endomorphisms ofQ: 

p(x) oc l{{xi,Xj)e£) 

(hj)&£ 

the necessary and sufficient condition for i and j to be in the same orbit is 

p(xi) = p{xj) and V/c G V : p{xk = i) = p{xk = j) orbit(i) = orbit(j) 

Note that p{xi = k) is simply the relative frequency of mapping of node i to node k in an endo¬ 
morphism. Therefore this conjecture simply states that for node i and j to be equivalent up to an 
automorphism of it is necessary and sufficient for them to have the same frequency of mapping 
to/from all other nodes of Q. While it is trivial to prove necessity (4=), we found it difficult to prove 
the statement in the other direction. 

Therefore, similar to other related conjectures [163, 204], we turn to computational verification. 
For this, we experimented with distinct graphs with up to 9 nodes (i.e., > 286,000 instances). Here, we 
obtained the exact marginals p{xi) Vi G V and constructed the orbits as suggested by conjecture 5.3.4. 
We then obtained the exact orbits using the software of McKay and Piperno [205]. In all cases two 
partitioning of the nodes to orbits were identical. 

We also note that in conjecture 5.3.4 restricting the condition to p{xi) = p(xj) - i.e., similar 
frequency of mappings “to” other nodes - is not sufficient. Figure 5.6 shows a counter-example for this 
weaker condition. 

Example 5.3.1. In figure 5.7(left) we used the homomorphism marginals to find approximate sym¬ 
metries in a graph with visible symmetries, known as Thomassen graph. Here, after obtaining the 
marginals we used Principle Component Analysis (PCA) with whitening [226] to extract three values 
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Figure 5. 7 ; Coloring of nodes by reducing the dimensionality of marginals to three dimensions of RGB using 
PC A with whitening, (left) The marginals are calculated using sum-product BP. (right) The marginals of the 
Kronecker graph is obtained using Gibbs sampling and annealing. The graph in the middle is the product of two 
graphs on the top and right - i.e., Q = Qi y. Q2- Two nodes in Q are connected iff the corresponding nodes in Qi 
and Q2 are connected. Note that the product of each two colors produces the same color in the product graph and 
similarly colored nodes have similar neighbours. 


for RGB colors. As the figure suggests, this approach is able to identify similar nodes with similar 
colors. 

For dense graphs, message passing is no longer accurate. In figure 5.7(right) we use Gibbs sampling 
with annealing to estimate the endomorphism marginals in Kronecker product [190] of two random 
graphs. Here again, we use PGA to color the nodes. Note that the algorithm is unaware of the product 
format. The choice of the product graph is to easily observe the fact that the product of similarly 
colored nodes produce similarly colored nodes in the product graph. 

An alternative is to use spectral clustering on the matrix of marginals to obtain clusters. In a related 
context Krzakala et al. [180] use the matrix of non-backtracking random-walks to find symmetric clusters 
in stochastic block models. Note that the basic difference with our approach is the matrix used with 
the spectral clustering. Other notable matrices that are used within this context are the Laplacian 
matrix [299], the modularity matrix [228] and the Bethe hessian [272]. The relation between the 
clustering (in its conventional sense) and orbits of a graph is better understood in the extreme case: 
when clusters form isolated cliques, they are identical to orbits. 

5.3.5 Graph alignment 

The graph alignment problem can be seen as the optimization counterpart of the decision problems of 
isomprphism, monomorphism and homomorphism. In the past, different methods have tried to optimize 
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a variety of different objectives [53, 72], In the context of message passing two distinct approaches have 
been used: 1) Bayati et al. [26] propose a factor-graph for “sparse” graph alignment and show that it 
scales to very large instances. Here the term sparsity both refers to the number of edges of Q and Q' 
and also to the restricted possibility of matching nodes of V to V' - ie., each node in V can match only 
a few predetermined nodes in V'. The factor-graph used by the authors resembles the binary version of 
the maximum bipartite matching factor-graph of section 5.1.1. 2) Bradde et al. [44] used the min-sum 
BP to minimize the number of misalignment in a factor-graph similar to that of graph monomorphism 
above. Here we follow their route, with the distinction that we suggest using the min-sum semiring with 
“sub-graph isomorphism” factor-graph and account for different matching costs using several tricks. 

Here we consider a general objective function that evaluates the mapping vr : —)• V U {null}. We 

then show how to optimize this objective using max-sum inference in a factor-graph. 

• Node matching preference for matching node i € V to j' € Vh : V x —)• M. 

• Edge matching preference for matching {i,j) G 6 to G 8': : T x T' —)• M. 

• Node merging preference for mapping nodes i,j G V to the same node k G V: •d{i,j,k') : 
V X V X V' ^ M. 

• Node deletion preference 5{i) : V —>• M, is the penalty for ignoring the node i G V - i.e., mapping 
it to the NULL node. 

• Edge deletion preference is the preference for dropping (i, j) G 8: w{i,j) : T —)• M. 

• Edge insertion preference is the preference for adding an edge G 8' , when it is not matched 

against any edge in T: v{i',j') ; T' —)• M. 

We can define these preferences in such a way that the optimal solution is also a solution to an 
interesting decision problem. 

Example 5.3.2. The optimal alignments with the following parameters reproduce Hom(^,^'); 

• node matching =0 M i GV,f gV 

• edge matching ?((i,j), (/s', Z')) = 1 \/ {i,j) G 8,{k',l') G 8' 

• node merging k') = 0 V i,j G V,k' G V' 

• node deletion S{i) = —oo V i G V 

• edge deletion w{i,j) = —oo V (i,j) G 8 

• edge insertion v{k', I') = 0 V (fc', 8) G 8' 

Alternatively, using positive and uniform node and edge matching preferences, if we set the merge 
cost to —oo and allow node deletion at zero cost, the optimal solution will be the maximum common 
subgraph. In particular, for maximum-edge common subgraph we use 

• node matching =0 M i GV,f gV 

• edge matching o((i,j), (/s', Z')) = 1 \/ {i,j) G 8,{k',l') G 8' 
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• node merging k') = —oo y i,j £ V, k' £ V' 

• node deletion 5{i) = 0 V i G V 

• edge deletion zu{i,j) = 0 V (i,j) G £ 

• edge insertion v{k', I') = 0 V {k', I') £ £' 

Given two weighted adjacency matrices A and A' (for Q and Q' respectively), where Ajj is the 
“flow” between the facilities i and j, while A^,is the “distance” between the locations k' and I', the 
quadratic assignment problem, seeks a one-to-one mapping tt* : V —)■ V' of facilities to locations in 
order to optimize the flow 

TT* = arg^ max ^ 
ij'ev 

Here w.l.o.g. we assume all weights are positive and set the alignment preferences so as to optimize the 
quadratic assignment problem: 

• node matching =0 V f G V,/ G V' 

• edge matching ?((i, j), ik',l')) = AjjA'^,V (f,j) G £, {k\l') £ £' 

• node merging k') = —oo y i,j £ V, k' £ V' 

• node deletion 5{i) = —oo V i G V 

• edge deletion zu{i,j) = —oo y {i, j) £ £ 

• edge insertion v{k\l') = —oo V {k\l') £ £' 

Now we define the factors based on various alignment preferences. The factor-graph has one variable 
per node i £ V: x = {xi | i G V}, where x* G V' U {null}. Here, Xi = NULL corresponds to ignoring 
this node in the mapping. The alignment factor-graph has three type of factors: 

• Local factors: take the node matching preferences and node deletion preference into account: 


h{xi) 


S(i) Xi = NULL 

ip{i, Xi) otherwise 


• Edge factors: are defined for each edge (i,j) G £ and partly account for edge matching, node 
merging and edge deletion: 


h{Xi,Xj) 


0 Xi = NULL V Xj = NULL 

y{i,j,Xi) Xi = Xj 

{xi,Xj)££' 

w{i,j) otherwise 


y{i,j) £ £ 
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• Non-edge factors: are defined for non-existing edge {i,j) ^ S and partly account for 

node merging and edge insertion: 


fi{Xi,Xj) 


0 Xi = NULL V Xj = NULL 

'd{i,j,Xi) Xi = Xj 
v{xi,Xj) {xi,Xj)e£' 

0 otherwise 


Vi,j / i G V, {i,j) i £ 


This factor-graph in its general form is fully connected. The cost of max-sum message-passing 
through each of these factors is 0(|V^| log(V'|)), which means each iteration of variable synchronous 
max-sum BP is 0(|Vp |V^| log(V^|)). However, if the matching candidates are limited (a.k.a. sparse 
alignment), and the graphs Q and Q' are sparse, this cost can be significantly reduced in practice. 


Example 5.3.3. Figure 5.8 shows a matching of E-coli metabolic network against a distorted version, 
where 50% of edges were removed and the same number of random edges were added. Then we generated 
10 matching candidate for each node of the original graph, including the correct match and 9 other 
randomly selected nodes. We used graph alignment with the following preferences to match the original 
graph against the distorted version 

• node matching =0 V i G V, j' G V' 

• edge matching ?((z,j), (/s', Z')) = 1 \/ {i,j) e £,{k',l') e £' 

• node merging k') = —oo V f, j G V, k' G V 

• node deletion 6{i) = —oo V z G V 

• edge deletion w{i,j) = 0 V (i,j) G £ 

• edge insertion v{k',1') = 0 V {k',l') G £' 

We observed that message passing using our factor-graph was able to correctly match all the nodes 
in two graphs. 


^We used T — 100 initial iterations with damping parameter A = .2. After this, we used decimation and fixed p = 1% 
of variables after each T = 50 iterations. The total run-time was less than 5 minutes. 
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Figure 5.8: Matching the E-coli metabolic network against a highly distorted version using message passing. 
Here 50% of \£\ = 4306 edges in the original network (left) are removed and the same number of random edges 
are added, to produce the distorted network (i.e., \£'\ = \£\). Each node had 10 random candidates for matching 
(including the correct choice) and message passing was able to identify the correct matching with 100% accuracy. 



Conclusion 


This thesis studied a general form of inference in graphical models with an emphasis on algebraic 
abstractions. We organized an important subset of these inference problems under an inference hierarchy 
and studied the settings under which distributive law allows efficient approximations in the form of 
message passing. We investigated different methods to improve this approximation in loopy graphs 
using 1) variational formulation and loop correction; 2) survey propagation; 3) hybrid techniques. 
We then studied graphical modelling of combinatorial optimization problems under different modes of 
inference. 

As with any other inference and optimization framework, graphical modeling has its pros and cons. 
The cons of using graphical models for combinatorial optimization are twofold a) implementing message 
passing procedures, when compared to other standard techniques such as using Integer Programming 
solvers, is more complex and time consuming. This is further complicated by b) the fact that there is 
no standard guideline for designing a factor-graph representation, so as to minimize the computational 
complexity or increase the quality of message passing solution. Indeed we used many tricks to efficiently 
approximate the solution to our problems; example include simplification of BP messages through 
alternative normalization, augmentation, variable and factor-synchronous message update, introduction 
of auxiliary variables, using damping and decimation etc. 

On the other hand, when dealing with large scale and difficult optimization problems, one has to 
resort to conceptual and computational decomposition, and graphical modelling and message passing 
techniques are the immediate candidates. Message passing is mass parallelizable, scalable and often 
finds high-quality solutions. By providing factor-graphs for a diverse set of combinatorial problems, 
this thesis also was an attempt to establish the universality of message passing. Of course some of 
these problems better lend themselves to graphical modelling than some others, resulting in better 
computational complexity and quality of results. Table 5.1 summarizes some important information 
about the message passing solutions to combinatorial problems that are proposed or reviewed in this 
thesis. 
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Table 5.1: Summary of message-passing solutions to combinatorial problems. The time complexity is for one 
iteration of message passing. We report different costs for different update schedules for the same problem. See 
the text for references. Here, N is the number of nodes and M is the number of constraints/factors. 


Problem 

Semiring ops. 

Complexity 

Schedule 

relation to others 

Belief Propagation 

* 

o{Ei\Xi\\m\^ + Ei\xm\) 

async. 

- 


not (min, max) 

o{Ei\Xi\\m\+Ei\xm\) 

v-sync. 



* 

oiEi\Xi\\m\+Y.i\Xi\) 

f-sync. 


Perturbed BP 

(+.X) 

o(Ei\Xi\\di\^+ Y.i\xm\) 

async. 

reduces to Gibbs samp. & BP 



o{T.i\Xi\\m\+Ei\xm\) 

v-sync. 


Survey Propagation 

* 

oiY.,2X‘\\di\^ + 

async. 

reduces to BP 

K-satisfiability 

(+.X) 

OiK^M + Y.i\di\'^) 

async. 

- 



O(K^M) 

v-sync. 




0(KM) 

{f,v)-sync. 


K-coloring 

(+,X) 


async. 

/i'-clique-cover 



0(K\£\) 

v-sync. 


K-clique-cover 

(+.X) 

o(KY.,(N -\e(i,.))\^) 

a-sync. 

= to KT-coloring on 


(+.X) 

0(K(N-^ - |£|)) 

v-sync. 


K-dominating-set 

(+.X) 

o(ifiv^ + E,,v|£G-)P + IGA)P) 

async. 

- 

& K-set-cover 


0(KN + \£\) 

f-sync. 


min set-cover 

(min, +) 

0(\£\) 

(f,v)-sync. 

similar to K-median 

K-independent-set 

(+,X) 

0(N^} 

async. 

binary var. model 

& K-clique 


0(KN^) 

f-sync. 

= K-independent-set on 



0(KN + \£\) 

(f,v)-sync. 


K-packing 

(min, max) 

0(\og{N){KN + \£\)) 

(f,v)-sync. 

(-h, x) reduction = K-independent-set 


(min, max) 

0(KN + \£\) 

(f,v)-sync. 

min-max BP 

K-independent-set 

(+,X) 

0(K‘^N^) 

async. 

categorical var. model 

& K-clique 


O(K^N^) 

v-sync. 

= K-independent-set on Q''- 

K-packing 

(min, max) 

0(K^N‘^ log{N)) 

v-sync. 

(-h, x) reduction = K-independent-set 

max independent set 

(max, -I-) 

0(\£\) 

(f,v)-sync. 


& min vertex cover 




= max independent-set 

sphere-packing 

(+,X) 

0(K^2^") 

v-sync. 


(Hamming) 


0(K‘'^n + K^n^y) 

async. 

- 

n: digits 


O(K^n^y) 

v-sync. 


y: min dist. 


O(K^nv) 

(f,v)-sync. 


K-medians 

(min, +) 

0(\£\) 

f-sync. 

a.k.a. affinity propagation 

facility location 

(min, -K) 

Oi\£\) 

f-sync. 


d-depth min span, tree 

(min, -K) 

0(d\£\) 

v-sync. 


prize-coil. Steiner tree 

(min, +) 

0(d\£\) 

v-sync. 


K-clustering 

(min, max) 

0(KN‘^\og(N)) 

v-sync. 

(-h, x) reduction = K’-clique-cover 

K-center 

(min, max) 

Oi\og(N)(KN + \£\)) 

f-sync. 

(-H, x) reduction = K-set-cover 



0(KN + \£\) 

min-max BP 


Modularity max 

(min, +) 



clique model, using augmentation 

Potts model 

max matching 

(min, -K) 

O(N^) 

v-sync 

- 

Sz cycle cover 





bottleneck assignment 

(min, max) 

- 

- 

- 

max 6-matching 

(min, -K) 

O(bN^) 

v-sync 

- 

TSP 

(min, -K) 

O(N^t) or ~ N-’ 

(f.v)-sync 

- 

bottleneck TSP 

(min, max) 

0(N'nog(N)) 

async 

(-h, x) reduction = Hamiltonian cycle 

subgraph isomorphism 

(+,X) 

0(|Vp|£'|) 

v-sync. 

G^G' 

subgraph monomorphism 

(+.X) 

0(|Vp |V'| + |f| |£'|) 

v-sync. 

G^G' 

subgraph supermorphism 

(+,X) 

0((|V2|-|£|) |£'| +|V'| |Vp) 

v-sync. 

G^G' 

homomorphism 

(+,X) 

0(\£\ |f'|) 

v-sync. 

G^G' 

graph alignment 

(max, -I-) 

0(|Vp |V'|log(|V'|)) 

v-sync. 

Q ^ Q' with general costs 

max common sub-graph 




a variation of graph alignment 

quadratic assignment 




a variation of graph alignment ^ 
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Future work 

Due to breadth of the models and problems that we covered in this thesis, our investigation lacks 
the deserved depth in many cases. This is particularly pronounced in the experiments. Moreover, we 
encountered many new questions and possibilities while preparing this thesis. Here, we enumerate some 
of the topics that demand a more in depth investigation in the future work 

• Many of the problems that we discussed also have efficient LP relaxations that some times come 
with approximation guarantees. A comprehensive experimental comparison of message passing 
and LP relaxations for these problems, both in terms of speed and accuracy is highly desirable. 

• Our algebraic approach to inference suggests that all message passing procedures discussed here, 
including survey propagation, are also applicable to the domain of complex numbers. Extensions 
to this domain not only may allow new applications {e.g., using Fourier coefficients as factors) but 
may also produce better solutions to many problems that we have studied here {e.g., in solving 
CSPs). 

• Our study of using graph homomorphism and its application to finding symmetries is a work in 
progress. In particular its relation to other methods such as stochastic block models and spectral 
techniques needs further investigation. 

• Although some preliminary results on Ising model suggested that using sum-product reductions 
for min-max inference performs much better than direct min-max message passing, an extensive 
comparison of these two approaches to min-max inference is missing in our analysis. 

• In section 3.7, we noted that several optimization counterparts to CSPs allow using binary-search 
in addition to a direct optimization approach. While using binary search is more expensive for 
these problems, we do not know which approach will performs better in practice. 
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input : Graph Q = (V,<?) with normalized (weighted)adjacency A, maximum iterations Tmax; 

damping A, threshold Cmax- 
output: A clustering C = {Ci,... ,Ck} of nodes, 
construct the null model 

while TRUE do // the augmentation loop 
e ^ 0, T ^ 0 

while e < Cmax and T < Tma x do // BP loop 
0 

for (i, j) G U 
Pi:i ^ Ptj 

p., ^ (A„- - A---) 

for IG di:j do // update beliefs 

calculate pi^iy using equation (4.7) 

Pi-.j ^ Pi-.j + Pl->-i:j 

end 

e ^ max{e, \pr,j - p'.^l} 

for I G di : j do // update msgs. 

Pi:j—>I ^ Pi:j Pi—>-i:j 

Pi:j^l ^ Api:j_>I + (1 — A)pi:j_^I 

end 

end 

T^T + 1 

end 

for i G V do 

for (i,j),(i,k)€£u£’^^^^do 

if pi:j > 0 and pj:^ > 0 and pi-^ < 0 then add the corresponding clique factor to the 
factor-graph 

end 

end 

if no factor was added then break out of the loop 
else Pi:j_i.i ^ 0 VI, i j G I 

end 

C ^ ConnectedComponents((V, {(i, j) G \ pi-j > 0})) 

Algorithm 3: Message Passing for Modularity Maximization. 
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input : Graph Q = {V,£), weighted (symmetric) adjacency matrix A, maximum iterations 
Tmax, damping A, threshold Cmax- 
output: A subset T C £ of the edges in the tour, 
construct the initial factor-graph 

initialize the messages for degree constraints ■(— OVi G V, j G £{i, •) 

initialize Pij ^ Ajj y{i,j) G £ 

while TRUE do // the augmentation loop 

e ^ 0, T ^ 0 

while e < emax and T < Tmax do //BP loop 
e ^ 0 

for each fi do// including f£:( 5 ,.) > fe{i,-) (updates all the outgoing messages 
from this factor) 

find three lowest values in | i : / G 51} 

for each i : j G I do 

calculate using equation (5.7) 

^ Pi—>i:j Pi—^i:j 
Pl^i-.j Pl->-i:j + Aei^i:j 
Pi:j ^ Pi:j “b I^I—^i:j 
e <- max(e, |ei_^e|) 
end 
end 

r ^ T-p 1 

end 

T ■<— G £ I Pi:j > 0} // respecting degree constraints. 

C ^ ConnectedComponents((V, T)) 
if \C\ = 1 then return T 

else augment the factor-graph with f£:( 5 ,.) V5 G C 
initialize P£(s,-)-^i:j ^ 0 ^ '■ j G £{S, •) 

end 

Algorithm 4: Message Passing for TSP 







